See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/301632677

Performance Evaluation of Data Migration Methods Between the Host and the

Device in CUDA-Based Programming

Chapter · April 2016

DOI: 10.1007/978-3-319-32467-8_60

CITATION

1
READS

55

3 authors:

Some of the authors of this publication are also working on these related projects:

Employing overlapping removal and representatives identification techniques in visualizations based on multidimensional projection View project

Multilevel visual representation to assist the exploration of datasets View project

Rafael Silva Santos

São Paulo State University

4 PUBLICATIONS   5 CITATIONS   

SEE PROFILE

Danilo Medeiros Eler

São Paulo State University

56 PUBLICATIONS   188 CITATIONS   

SEE PROFILE

Rogério Eduardo Garcia

São Paulo State University

64 PUBLICATIONS   207 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Rafael Silva Santos on 23 May 2018.

The user has requested enhancement of the downloaded file.

https://www.researchgate.net/publication/301632677_Performance_Evaluation_of_Data_Migration_Methods_Between_the_Host_and_the_Device_in_CUDA-Based_Programming?enrichId=rgreq-b8d856bf7400597b13e2bf9ce44f0570-XXX&enrichSource=Y292ZXJQYWdlOzMwMTYzMjY3NztBUzo2MjkyNzcxMTY2MjQ4OTZAMTUyNzA0Mjc2NDQ4MA%3D%3D&el=1_x_2&_esc=publicationCoverPdf
https://www.researchgate.net/publication/301632677_Performance_Evaluation_of_Data_Migration_Methods_Between_the_Host_and_the_Device_in_CUDA-Based_Programming?enrichId=rgreq-b8d856bf7400597b13e2bf9ce44f0570-XXX&enrichSource=Y292ZXJQYWdlOzMwMTYzMjY3NztBUzo2MjkyNzcxMTY2MjQ4OTZAMTUyNzA0Mjc2NDQ4MA%3D%3D&el=1_x_3&_esc=publicationCoverPdf
https://www.researchgate.net/project/Employing-overlapping-removal-and-representatives-identification-techniques-in-visualizations-based-on-multidimensional-projection?enrichId=rgreq-b8d856bf7400597b13e2bf9ce44f0570-XXX&enrichSource=Y292ZXJQYWdlOzMwMTYzMjY3NztBUzo2MjkyNzcxMTY2MjQ4OTZAMTUyNzA0Mjc2NDQ4MA%3D%3D&el=1_x_9&_esc=publicationCoverPdf
https://www.researchgate.net/project/Multilevel-visual-representation-to-assist-the-exploration-of-datasets?enrichId=rgreq-b8d856bf7400597b13e2bf9ce44f0570-XXX&enrichSource=Y292ZXJQYWdlOzMwMTYzMjY3NztBUzo2MjkyNzcxMTY2MjQ4OTZAMTUyNzA0Mjc2NDQ4MA%3D%3D&el=1_x_9&_esc=publicationCoverPdf
https://www.researchgate.net/?enrichId=rgreq-b8d856bf7400597b13e2bf9ce44f0570-XXX&enrichSource=Y292ZXJQYWdlOzMwMTYzMjY3NztBUzo2MjkyNzcxMTY2MjQ4OTZAMTUyNzA0Mjc2NDQ4MA%3D%3D&el=1_x_1&_esc=publicationCoverPdf
https://www.researchgate.net/profile/Rafael_Santos83?enrichId=rgreq-b8d856bf7400597b13e2bf9ce44f0570-XXX&enrichSource=Y292ZXJQYWdlOzMwMTYzMjY3NztBUzo2MjkyNzcxMTY2MjQ4OTZAMTUyNzA0Mjc2NDQ4MA%3D%3D&el=1_x_4&_esc=publicationCoverPdf
https://www.researchgate.net/profile/Rafael_Santos83?enrichId=rgreq-b8d856bf7400597b13e2bf9ce44f0570-XXX&enrichSource=Y292ZXJQYWdlOzMwMTYzMjY3NztBUzo2MjkyNzcxMTY2MjQ4OTZAMTUyNzA0Mjc2NDQ4MA%3D%3D&el=1_x_5&_esc=publicationCoverPdf
https://www.researchgate.net/institution/Sao_Paulo_State_University?enrichId=rgreq-b8d856bf7400597b13e2bf9ce44f0570-XXX&enrichSource=Y292ZXJQYWdlOzMwMTYzMjY3NztBUzo2MjkyNzcxMTY2MjQ4OTZAMTUyNzA0Mjc2NDQ4MA%3D%3D&el=1_x_6&_esc=publicationCoverPdf
https://www.researchgate.net/profile/Rafael_Santos83?enrichId=rgreq-b8d856bf7400597b13e2bf9ce44f0570-XXX&enrichSource=Y292ZXJQYWdlOzMwMTYzMjY3NztBUzo2MjkyNzcxMTY2MjQ4OTZAMTUyNzA0Mjc2NDQ4MA%3D%3D&el=1_x_7&_esc=publicationCoverPdf
https://www.researchgate.net/profile/Danilo_Eler?enrichId=rgreq-b8d856bf7400597b13e2bf9ce44f0570-XXX&enrichSource=Y292ZXJQYWdlOzMwMTYzMjY3NztBUzo2MjkyNzcxMTY2MjQ4OTZAMTUyNzA0Mjc2NDQ4MA%3D%3D&el=1_x_4&_esc=publicationCoverPdf
https://www.researchgate.net/profile/Danilo_Eler?enrichId=rgreq-b8d856bf7400597b13e2bf9ce44f0570-XXX&enrichSource=Y292ZXJQYWdlOzMwMTYzMjY3NztBUzo2MjkyNzcxMTY2MjQ4OTZAMTUyNzA0Mjc2NDQ4MA%3D%3D&el=1_x_5&_esc=publicationCoverPdf
https://www.researchgate.net/institution/Sao_Paulo_State_University?enrichId=rgreq-b8d856bf7400597b13e2bf9ce44f0570-XXX&enrichSource=Y292ZXJQYWdlOzMwMTYzMjY3NztBUzo2MjkyNzcxMTY2MjQ4OTZAMTUyNzA0Mjc2NDQ4MA%3D%3D&el=1_x_6&_esc=publicationCoverPdf
https://www.researchgate.net/profile/Danilo_Eler?enrichId=rgreq-b8d856bf7400597b13e2bf9ce44f0570-XXX&enrichSource=Y292ZXJQYWdlOzMwMTYzMjY3NztBUzo2MjkyNzcxMTY2MjQ4OTZAMTUyNzA0Mjc2NDQ4MA%3D%3D&el=1_x_7&_esc=publicationCoverPdf
https://www.researchgate.net/profile/Rogerio_Garcia?enrichId=rgreq-b8d856bf7400597b13e2bf9ce44f0570-XXX&enrichSource=Y292ZXJQYWdlOzMwMTYzMjY3NztBUzo2MjkyNzcxMTY2MjQ4OTZAMTUyNzA0Mjc2NDQ4MA%3D%3D&el=1_x_4&_esc=publicationCoverPdf
https://www.researchgate.net/profile/Rogerio_Garcia?enrichId=rgreq-b8d856bf7400597b13e2bf9ce44f0570-XXX&enrichSource=Y292ZXJQYWdlOzMwMTYzMjY3NztBUzo2MjkyNzcxMTY2MjQ4OTZAMTUyNzA0Mjc2NDQ4MA%3D%3D&el=1_x_5&_esc=publicationCoverPdf
https://www.researchgate.net/institution/Sao_Paulo_State_University?enrichId=rgreq-b8d856bf7400597b13e2bf9ce44f0570-XXX&enrichSource=Y292ZXJQYWdlOzMwMTYzMjY3NztBUzo2MjkyNzcxMTY2MjQ4OTZAMTUyNzA0Mjc2NDQ4MA%3D%3D&el=1_x_6&_esc=publicationCoverPdf
https://www.researchgate.net/profile/Rogerio_Garcia?enrichId=rgreq-b8d856bf7400597b13e2bf9ce44f0570-XXX&enrichSource=Y292ZXJQYWdlOzMwMTYzMjY3NztBUzo2MjkyNzcxMTY2MjQ4OTZAMTUyNzA0Mjc2NDQ4MA%3D%3D&el=1_x_7&_esc=publicationCoverPdf
https://www.researchgate.net/profile/Rafael_Santos83?enrichId=rgreq-b8d856bf7400597b13e2bf9ce44f0570-XXX&enrichSource=Y292ZXJQYWdlOzMwMTYzMjY3NztBUzo2MjkyNzcxMTY2MjQ4OTZAMTUyNzA0Mjc2NDQ4MA%3D%3D&el=1_x_10&_esc=publicationCoverPdf


Performance Evaluation of Data Migration
Methods between the Host and the Device in

CUDA-based Programming

Rafael Silva Santos, Danilo Medeiros Eler, and Rogério Eduardo Garcia

Faculdade de Ciências e Tecnologia, UNESP - Univ Estadual Paulista
Presidente Prudente, SP - Brazil

rafael.silva.sts@gmail.com, {daniloeler, rogerio}@fct.unesp.br

Abstract. CUDA-based programming model is heterogeneous – com-
posed of two components: host (CPU) and device (GPU). Both com-
ponents have separated memory spaces and processing units. A great
challenge to increase GPU-based application performance is the data
migration between these memory spaces. Currently, the CUDA plat-
form supports the following data migration methods: UMA, zero-copy,
pageable and pinned memory. In this paper, we compare the zero-copy
performance method with the other methods by considering the over-
all application runtime. Additionally, we investigated the aspects of data
migration process to enunciate causes of the performance variations. The
obtained results demonstrated in some cases the zero-copy memory can
provide an average performance on 19% higher than the pinned memory
transfer. In the studied situation, this method was the second most effi-
cient. Finally, we present limitations of zero-copy memory as a resource
for improving performance of CUDA applications.

1 Introduction

By considering the involved architectures, CUDA-based programming model is
heterogeneous – composed of two components: host (CPU) and device (GPU) [5].
Both components have separated memory spaces and processing units. A great
challenge for performance increasing in CUDA-based programming is the data
migration between host and device memory spaces [7]. It is not a simple task
to manage this data transfer, even without focusing in performance, since in
some methods it is necessary to perform explicit requests for memory copy and
the control of concurrent data access [8]. At present, the latest CUDA version
provides four main methods for data migration between host and device memory
spaces: zero-copy memory, UMA (Unified Memory Access) model, pageable and
pinned memory [11].

In this paper, we investigate the zero-copy migration method, by comparing
the performance with the other three memory migration methods provided by
CUDA API and evaluating all runtime tests. We conducted a case study for
the migration process. First, we developed an application to perform a scalar


multiplication and the sum from the elements of a vector. Then, we adapted
this application to execute and to evaluate two situations. In the first, less mem-
ory access is performed by the GPU during threads execution; in the second, a
greater number of access transactions is performed. Finally, we present the lim-
itations of using zero-copy memory method as an approach to increase a CUDA
application performance.

Besides the introduction, this document consists of another five sections.
Related works are introduced in Sect. 2. In Sect. 3, we describe the main data
migration methods between host and device on the CUDA architecture. Then,
in Sect. 4, we present the results. In this section, we evaluate and discuss the
performance of the methods described in Sect. 3, comparing them with zero-copy
memory method. In Sect. 5, we present some limitations in the use of zero-copy
memory. Finally, Sect. 6 concludes this paper.

2 Related Works

In the last years, various studies on the transfer of data between the different
architectures of the GPU-based programming model were produced. Kaldewey
et al. [4] conducted a study on the efficiency of use of the bandwidth of the PCI-
E bus in communication between the host and the device in the UVA model.
By zero-copy memory they demonstrated the performance is close to the the-
oretical maximum bandwith memory. However, in that study, a comparison of
the impact of the use of this memory in the global application performance
compared to other transfer methods was not performed. Bai et al. [1] used the
zero-copy memory to optimize algorithms performance for lattice-based cryp-
tographic systems. In the tests, they analyzed both single GPU and multiple
GPUs communicating with the CPU. Landaverde et al. [7] conducted a research
of UMA model performance from the pageable memory transfer. In the method-
ology, they established a benchmark model to the methods similar to the model
used in this work. Based on the results, they demonstrated that the use of UMA
model causes a performance loss regarding to the pageable and pinned memory.

3 Data Migration between Host and Device

3.1 CUDA Main Memory Access Model

Compute Unified Device Architecture (CUDA) is a platform developed by NVIDIA
to allow use of GPUs in general-purpose computing. CUDA-based programming
model is heterogeneous and each component, host and device, owns processing
unit and memory space [11].

The rate that a processing unit can access the main memory is limited by
its self memory bandwidth. In CUDA-based programming model, in addition
to memory bandwidth of GPU and CPU, there is a bus between the host and
device, as shown in in Fig. 1. Typically, the bus that mediates this communication
is the PCI (Peripheral Component Interconnect). The PCI memory bandwidth


Fig. 1: Heterogeneous CUDA-programming model

is generally less than CPU and GPU [10]. Thus, it is important to analyze the
memory transfer impacts on the design of a CUDA application.

3.2 CUDA Data Transfer Methods

Along the versions, NVIDIA introduced different memory management methods
on the CUDA platform. By the time this study was conducted, the latest version
of CUDA SDK – CUDA 7 SDK – offers the following methods: pageable memory,
pinned memory, zero-copy memory, and Unified Memory Access (UMA) model
[11].

Standard Transfer Method: Standard transfer method can be executed with
pageable memory or pinned memory. In a pageable memory, during the ex-
ecution of an application, the physical address of the data may change, given
the memory pages may undergo swap for the secondary memory. In this way,
before the data is copied from the host to the device, architecture migrates the
desired data portion for pinned memory buffer on the host [11]. Then, through
the PCI resource Direct Memory Access (DMA) performs the data transfer from
the buffer to the device [3]. In contrast to pageable memory, pinned memory
does not undergo swap. Thus, in standard transfer method of pinned memory,
it does not occur to data migration to the buffer. This feature allows the PCI
use DMA to directly transfer the data from the current physical location in host
memory to the device memory space [10]. In pageable memory, the additional
transfer contributes to a decrease in performance (memory bandwidth) regarding
to pinned memory [2].

Zero-copy Method: Zero-copy memory is a “kind” of pinned memory provide
by Unified Virtual Addressing (UVA). The use of zero-copy method discards the
need to explicit requests (i.e., a function call) for data migration. UVA model
provides a single shared virtual space of memory between the host and the device,
providing the address of the data on the host is mapped to the device [9]. Data
migration occurs implicitly whenever a portion that is not presented in context
is referenced, either the host or the device [10]. Thus, the zero-copy memory is
used when a set of data does not fit completely in the device memory; once the
CUDA API maintains only a portion of data that is accessed at a specific time
in the device [6]. The remaining data set is maintained in host memory [10]. By


considering the use of multiple GPUs, UVA model also provides a shared memory
space and eliminates the intermediate step to copy the host memory during data
transfer from one GPU to another. In the state of the art, the aspects of data
movement in zero-copy memory provide by CUDA API are not disclosed.

UMA model: Unified Memory Access (UMA) model is similar to zero-copy
memory regarding no need for an explicit request for data migration. On the
other hand, the data is not allocated in a pinned memory. As in zero-copy mem-
ory, the API is responsible for managing the entire data migration process.

4 Case Study: Evaluation of Memory Management
Methods

In this section, we evaluate the performance of data migration between the
host and device memory spaces. In addition to the performance comparison, we
investigated the aspects that justify the variation between migration methods.

In the tests, we setup two different situations for the scenario: in the first,
described in Subsect. 4.1, we created a simulated application in which the GPU
makes a small amount of memory access transactions for each kernel thread; in
the second, we adapted this application to the GPU performs a greater amount
of memory access transactions, as shown in Subsect. 4.2.

4.1 Situation 1: Less Amount of Memory Access Transactions
During a Thread Execution

Application Model For the tests, we have developed an application that per-
forms scalar multiplication and sum from vector elements. The implemented
algorithm can be divided into the following steps:

1. Initialization of vector elements
2. Data transfer to the device (HtoD)
3. Vector scalar multiplication (kernel)
4. Data transfer to the host (DtoH)
5. Sum of vector elements

The step that comprehends the vector scalar multiplication was parallelized
and runs on the GPU. The code snippet that is executed in this step can be seen
in Fig. 2.

The host is responsible for the initialization steps and sum of vector elements.
These steps are performed sequentially. Throughout the algorithm, all steps use
the vector as the Input and Output Dataset. After the initialization of the vector,
it is necessary to migrate this set of data from the host to device (HtoD). Then,
after multiplication by a scalar value, the vector should be migrated back to the
host (DtoH). The context of CUDA application comprehends a single stream.
A pseudo-code containing all algorithm steps is presented in Algorithm 1. The


1 g l o b a l void mult ( int ∗vect , int num, int N) {
2 //Thread index
3 int id = blockIdx . x ∗ blockDim . x + threadIdx . x ;
4 //Mu l t i p l i c a t i on = vec tor element ∗ constant va lue
5 i f ( id < N)
6 ∗( vect + id ) = ∗( vect + id ) ∗ num;
7 }

Fig. 2: Code snippet that runs on the device for vector multiplication by a constant.

code snippet that contains the parallelized instructions was previously shown in
Fig. 2.

To compare and investigate the performance of each memory management
method, we adapted the application with the necessary function calls to allocate
and copy memory. It is worth mentioning that we focus exclusively on evaluating
the performance differences and research aspects of the data migration between
the host and the device. Thus, the implemented algorithm is not complex and
does not require a large computational effort.

Algorithm 1 Scalar multiplication and sum of vector elements

Require: n > 0
Ensure: vec← vec ∗ c

vec← Allocation
c← k1 {k1 is a constant}
for i = 1, i ≤ n, i + + do

vec[i] = k2 {k2 is a constant}
end for
⇒ migration data – HtoD
mult<<<N, 1>>> (vec, c, n) {CUDA kernel}
⇒ migration data – DtoH
sum = 0
for i = 1, i ≤ n, i + + do

sum = sum + vec[i]
end for

Methodology We developed an application which performs a scalar multi-
plication and sum from the vector elements (see the Subsect. 4.1) for our mi-
crobenchmarks. NVIDIA Visual Profiler and nvprof tools were used to measure
processing time and to obtain details of the data transfer aspects throughout
the runtime. We run the application with four different configurations of data
migration methods provided by CUDA platform.

In the tests, we use five different vector sizes. The arrangement of threads and
threads blocks on kernel function was performed based on these sizes. Number of
elements in each vector size: 1048576 (1024 threads x 1024 blocks), 2097152 (1024


threads x 2048 blocks), 4194304 (1024 threads x 4096 blocks), 8388608 (1024
threads x 8192 blocks) and 16777216 (1024 threads x 16384 blocks) elements.

The timeline during runtime for each test was divided into five ranges as the
steps that have been described in the implemented algorithm. To define each
time range, we use NVTX (NVIDIA Tools Extension). We named these ranges
as follows: initialization, HtoD transfer, multiplication, DtoH transfer and sum.
It is easy to associate each range with the described algorithm steps, which are
presented in Subsect. 4.1. Data transfer ranges were not considered in the UMA
model and the zero-copy memory, once there is no explicit memory copy in these
methods.

All results represent an average of five executions. All the tests were per-
formed on computer with Windows 8.1 64 Bits Operating System, Intel I5-2320
CPU, 4 GB RAM, PCI Express x16 2.0 bus, NVIDIA Geforce GT 740 GPU,
with 1 GB RAM DDR3 and 384 CUDA cores. The application was implemented
using the CUDA 7 SDK.

Experimental Results In this section, we present results of the tests. In order
to provide a better understanding, all values were normalized.

Fig. 3: Global normalized runtime for the application of scalar multiplication and ad-
dition of vector elements (Situation 1).

Figure 3 shows the normalized runtimes for application that performs the
scalar multiplication and the sum from the vector elements. In particular, the
results match the configured application in four memory management methods:
pageable memory (represented by M1), pinned memory (represented by M2),
UMA (represented by M3) and zero-copy memory (represented by M4). Apart
from the global runtime, it is also shown the runtime of each range that the appli-
cation was divided: initialization, HtoD transfer, multiplication, DtoH transfer
and sum.

From Fig 3, we can see that there was a distinct variation in application
performance among the data migration methods. Additionally, it is possible to
observe the results were consistent, once the performance variation pattern in
retained once the array size is increased. From the obtained results, we ordered


the migration data methods regarding best described performance: zero-copy
memory, UMA model, pageable memory and pinned memory. In all presented
tests, this order of efficiency is the same. On average, the application configured
with the standard transfer method with pinned memory spends 19.40% more
time than zero-copy memory; with pageable memory, 37.24% more time was
spent; and in UMA model, it was 256.57%.

In order to figure out the reasons for performance variations among the meth-
ods, we evaluate aspects of data migration during tests runtime. In the tests,
the configuration with pageable memory spends on average 62.47% more time
than the time spent by the pinned memory in HtoD transfer range and 77.40%
in DtoH transfer range. Through the NVIDIA Visual Profiler, it is possible to
measure the throughput of pinned memory – on average 6.68 GB/s (HtoD) and
6.697 GB/s (DtoH), whereas in the pageable memory was 3.69 GB/s (HtoD) and
3.87 GB/s (DtoH). In the other analyzed ranges, the time consumed by both
methods is the same. On average, there is a difference of 0.59% in relation to the
run time in the initialization, 1.52% in multiplication, and 2.43% in the sum of
vector elements. In the standard transfer method, the data migration does not
affect the performance of other ranges that the timeline of the tests was divided.
Thus, we used this method to compare and to investigate the data migration
aspects and performance in UMA model and zero-copy memory.

In UMA model, there is no explicit memory copy. However, from Fig. 3, of
course notice a high discrepancy between the processing time of this method
regarding the standard transfer method, when we analyze the ranges of initial-
ization, multiplication and sum from vector elements. On average, the UMA
model spent 152% more time than the average of the pageable and pinned mem-
ories during initialization, 670% more than multiplication and 251% more than
sum range.

Fig. 4: Multiplication range runtime. Configured application with the memory manage-
ment methods: zero-copy memory, pageable memory and pinned memory.


In all tests, the zero-copy memory is more efficient. As in UMA model, there
is no explicit memory copy. However, the data migration takes place at different
times. Figure 4 shows the time consumed during the execution of multiplication
range for zero-copy memory and standard transfer methods for all tests. As we
can see, once the vector has more than 2097152 (2048x1024) elements, the kernel
runtime with zero-copy method becomes greater than the time taken by the
standard transfer method with pageable memory and pinned. NVIDIA Visual
Profiler tool does not support a graphical analysis of the data migration with
zero-copy memory. Although, it is possible to collect information about reading
and writing transactions in system memory, i.e., in the host, while running the
kernel. In all executed tests, regardless of the vector size, there are on average
2097152 (2048x1024) access transactions to host memory (reading and writing)
for the kernel running with zero-copy memory. Each Access transaction features
32-bit width and the transfer rate is on average 5.88 GB/s.

Discussion The results demonstrate the data migration between host and de-
vice memory spaces effectively impacts the application performance. Addition-
ally, different memory management methods provided by CUDA, exhibit high
performance variation.

Zero-copy memory method was more efficient than other methods. However,
we cannot assert that for any application of this method will be more efficient.
The results demonstrate that the use of zero-copy memory affects the perfor-
mance of the kernels, once that occur access transactions to host memory dur-
ing the execution. Thus, to further investigate the behavior of this method, we
modified the kernel function to run the tests again. The modified kernel and the
results obtained are shown in Subsect. 4.2.

The pinned memory obtained the second best performance. Regarding page-
able memory, the performance difference is caused by the memory throughput.
Pinned memory has a transfer rate about 77% higher than the pageable memory.
The performance of UMA was lower than all other methods.

4.2 Situation 2: Greatest Amount of Memory Access Transactions
During the Execution of a Thread

In order to better investigate the behavior of zero-copy memory and another
memory management methods, we adapt the original kernel function shown in
Fig. 2. The modified code snippet can be seen in Fig. 5. The kernel modification
does not cause changes in the application results. The modification added 49
redundant instructions of each thread execution. By performing this kernel, we
intend to simulate an increased amount of memory access transactions while
running the kernel. Note that the remaining steps of the algorithm (Algorithm
1) have not been modified. Thus, the produced results by the application are
the same.

Methodology To perform the tests, we use the same methodology from the
original application (see Subsect. 4.1), i.e., with unmodified kernel.


1 g l o b a l void mult ( int ∗vect , int num, int N) {
2 //Thread index
3 int id = blockIdx . x ∗ blockDim . x + threadIdx . x ;
4 //Mu l t i p l i c a t i on = vec tor element ∗ constant va lue
5 i f ( id < N)
6 for ( int i = 0 ; i < 50 ; i++)
7 ∗( vect + id ) = ∗( vect + id ) ∗ num;
8 }

Fig. 5: Code snippet of modified kernel.

Experimental Results Fig. 6 shows the normalized runtime for the applica-
tion in Situation 2. In comparison with Fig. 3, which represents the application
execution time in Situation 1, we can observe that the relative time consumed by
the multiplication range increased in all tests. Moreover, in all other observed
ranges (initialization, HtoD transfer, DtoH transfer and sum), the spent run-
time was the same. In all tests, zero-copy memory has the lowest performance.
Whereas, the standard transfer method with pinned memory is the most efficient.

Fig. 6: Global normalized runtime for the application with kernel adapted (Situation
2).

The ratio of the multiplication range runtime with both modified kernel and
unmodified kernel is shown in Figure 7. From this figure, we can see that the
standard transfer method increases of about 20 times the runtime. Whereas the
runtime increase in zero-copy memory was at least 40 times. Through NVIDIA
Visual Profiler, we can observe in zero-copy memory, the kernel runtime increase
is accompanied by an increase in the number of host memory access transactions.
In all tests, the number of access transactions increased 50 times, which corre-
sponds to the increased number of instructions in the modified kernel.

Discussion The tests demonstrated that an increase in access to data during
kernel execution affects the migration of memory between the host and the
device when using the zero-copy memory method. Additionally, we can observe


Fig. 7: Runtime ratio of multiplication range. Ratio between runtime multiplication in
Situation 2 and Situation 1

that there is not a self optimization CUDA API when using zero-copy memory,
once the kernel modification aims to increase the amount of memory access
transactions and leads to the execution of redundant instructions.

5 Limitations of Zero-copy Memory Usage

The first limitation on the use of zero-copy memory resides in the fact that the
data is allocated in a pinned memory. The allocation of large amounts of pinned
memory can affect the operating system performance [10]. Unfortunately, it is
not possible to measure a precise relationship between the amount of memory
allocated and total memory installed in the system. Beyond the amount available
memory, the operating system and other applications that are running in the
environment influence the performance loss of the entire system. On this way, a
good practice is not to use zero-copy memory when it is not known in advance,
the maximum amount of data that will be allocated. As discussed in Subsect.
4.2, a problem in the use of zero-copy memory is the occurrence of performance
loss when the amount of memory access transactions increases. Particularly, part
of these transactions may include redundant copies performed on a kernel. In
some cases, it can use a local variable within the kernel function that receives
a copy of the data used for to avoid the redundant accesses. Basically, the
zero-copy memory must be used on data undergoing a lesser amount
of access during transactions execution. In case of more amount of access
transactions, other transfer methods are recommendable.

6 Conclusion

Zero-copy memory presents implicit and transparent memory copies to the pro-
grammer, hiding the complexity of managing the data migration. Originally, this
method was conceived as a feature that allows the use of data sets that may not
be entirely stored in the memory device.

This study showed that in cases in which the kernel function performs a small
amount of memory access transactions (in particular, a single access transaction),


zero-copy memory can be used to provide performance increase. In certain cases,
the use of zero-copy memory can provide a performance gain of more than 19%
in the runtime application when compared to pinned memory. Based on the ob-
tained results, we demonstrated that the total number of memory access trans-
actions during execution of the kernel reduces the overall performance of the
application and establishing a barrier in using zero-copy memory.

In the tested situations, we do not use multiple streams and memory copy
process was not overlapped by the running kernel. Therefore, further works may
include analysis of the performance of the zero-copy memory in a concurrent
streams scenario. In a other future study, we will investigate the performance of
zero-copy memory in other models of GPUs and also in the OpenCL API.

7 Acknowledgment

This work was partially supported by FAPESP (State of São Paulo Research
Foundation) grants (2015/00622-7).

References

1. Bai, T., Davis, S., Li, J., Jiang, H.: Analysis and acceleration of ntru lattice-based
cryptographic system. In: Software Engineering, Artificial Intelligence, Networking
and Parallel/Distributed Computing (SNPD), 2014 15th IEEE/ACIS International
Conference on. pp. 1–6 (June 2014)

2. Fatica, M.: Accelerating linpack with cuda on heterogenous clusters. In: Proceed-
ings of 2Nd Workshop on General Purpose Processing on Graphics Processing
Units. pp. 46–51. GPGPU-2, ACM, New York, NY, USA (2009)

3. Hennessy, J.L., Patterson, D.A.: Computer Architecture, Fifth Edition: A Quan-
titative Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,
5th edn. (2011)

4. Kaldewey, T., Lohman, G., Mueller, R., Volk, P.: Gpu join processing revisited. In:
Proceedings of the Eighth International Workshop on Data Management on New
Hardware. pp. 55–62. DaMoN ’12, ACM, New York, NY, USA (2012)

5. Kim, Y., Shrivastava, A.: Memory performance estimation of cuda programs. ACM
Trans. Embed. Comput. Syst. 13(2), 21:1–21:22 (Sep 2013)

6. Kirk, D.B., Hwu, W.m.W.: Programming Massively Parallel Processors: A Hands-
on Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1st
edn. (2010)

7. Landaverde, R., Zhang, T., Coskun, A., Herbordt, M.: An investigation of unified
memory access performance in cuda. In: High Performance Extreme Computing
Conference (HPEC), 2014 IEEE. pp. 1–6 (Sept 2014)

8. Li, W., Jin, G., Cui, X., See, S.: An evaluation of unified memory technology
on nvidia gpus. In: Cluster, Cloud and Grid Computing (CCGrid), 2015 15th
IEEE/ACM International Symposium on. pp. 1092–1098 (May 2015)

9. Tang, K., Yu, Y., Wang, Y., Zhou, Y., Guo, H.: Ema: Turning multiple address
spaces transparent to cuda programming. In: ChinaGrid Annual Conference (Chi-
naGrid), 2012 Seventh. pp. 170–175 (Sept 2012)

10. NVIDIA Corporation: CUDA C Best Practices Guide (Mar 2015)
11. NVIDIA Corporation: CUDA C Programming Guide (Mar 2015)

View publication statsView publication stats

https://www.researchgate.net/publication/301632677