Overlap_Data_Transfers_in_CUDA

How to Overlap Data Transfers in CUDA C/C++

2020 Jan 27th _**CookieLau**_

[toc]

Streams

Cuda 的数据并行需要依靠 Stream 流来实现。默认情况下，Cuda 的 Device 活动都会被分配在默认流中，而在同一流中的操作只能顺序执行，失去了并行的效率。 Cuda 可以通过 cudaStreamCreate() 和 cudaStreamDestroy() 实现 非默认流的创建和销毁。

cudaStream_t stream1;
cudaError_t result;
result = cudaStreamCreate(&stream1)
// do something
result = cudaStreamDestroy(stream1)

The default stream

默认流自带同步性质，任何一个在默认流中执行的 kernel 核函数，都隐式地在 kernel 调用前后加上了一句 cudaDeviceSynchronize() ，即只有当所有其他的流中都执行结束才能开始默认流中的工作，只有当默认流中的工作做完才能开始其他流中的工作，这对并行来说非常不利。

有些函数的调用会阻塞 host，有些则会阻塞 device，视具体的函数而定，如：

cudaMemcpy Host2Device 会阻塞 host
kernel<<<>>> 会阻塞 device

当 host 被阻塞时，host 无法调用 kernel 所以相当于将 host 和 device 一同阻塞了。

但是对于只阻塞 device 的调用而言，如自定义核函数的调用，此时没有对 host 加以限制，所以一定程度上是可以使得 device 和 host 并行：

01. cudaMemcpy(d_a, a, numBytes, cudaMemcpyHostToDevice);
02. kernel<<<1,N>>>(d_a); // work on device, do not block host
03. myCpuFunction(b); // work on host
04. cudaMemcpy(a, d_a, numBytes, cudaMemcpyDeviceToHost);

对于 host 而言，在 01 H2D 被阻塞了，等待数据传输完成之后调用 02 kernel，调用之后立刻返还控制权，此时 device 开始工作，与此同时，host 也执行 03 myCpuFunction(b) 然后同 device 一起进入 04 D2H

对于 device 而言，在 01 H2D，在 02 kernel 执行核函数，然后再 04 D2H 返还数据。对于 03 myCpuFunction 完全不知情。

Non-default streams

在保证数据独立性的情况下，我们可以通过在创建非默认流来专门用于数据的传输，如：

cudaStream_t stream1;
cudaError_t result;
result = cudaStreamCreate(&stream1)
result = cudaStreamDestroy(stream1)
result = cudaMemcpyAsync(d_a, a, N, cudaMemcpyHostToDevice, stream1)

对于像 cudaMemAsync 这种带有 Async 后缀的函数而言，其都是非阻塞 non-blocking 的，所以执行之后 host 又立刻获得控制权，继续向下运行。

对于自定义核函数指定流的方式，就是在 <<<>>> 的第四个参数写上流的 StreamId 即可，0 为默认流

increment<<<1,N,0,stream1>>>(d_a)

P.S. 其中第三个参数是共享内存的参数，在这里不涉及。

Synchronization with streams

对于显示的同步有如下几种方法：
cudaDeviceSynchronize()：最粗粒度的同步，将所有的 host 和 device 都同步在一起，对效率影响最大
cudaStreamSynchronize(stream)：对某个流的同步，只会对 host thread 进行阻塞，对于其他的 device 上或同一 device 上的其他流不造成影响，兄弟函数cudaStreamQuery(stream) 用于查询 stream 流中的指令是否已经运行完成
cudaEventSynchronize(event)：对某个记录的事件的同步，也有兄弟函数 cudaEventQuery(event)
cudaStreamWaitEvent(stream, event)：让 stream 等待 event 事件的结束，其中 event 可以不是 stream 上的事件，可以是其他流上的，甚至是其他 device 上的事件。

Overlapping Kernel Execution and Data Transfers

要想实现并行所需要具备的三个基本条件：
设备的 compute capability 大于等于 1.1，才具备数据的复制和指令的执行并行的能力；
核函数的运行和数据的传输必须在 不同的、非默认流上，否则必会导致串行的出现；
数据传输中关于 host 的部分内存必须是 pinned，即可以知道位置的，不能是未知的。

Example: 对数据分块执行 kernel 核函数

// 循环执行 H2D->kernel->D2H
for (int i = 0; i < nStreams; ++i) {
  int offset = i * streamSize;
  cudaMemcpyAsync(&d_a[offset], &a[offset], streamBytes, cudaMemcpyHostToDevice, stream[i]);
  kernel<<<streamSize/blockSize, blockSize, 0, stream[i]>>>(d_a, offset);
  cudaMemcpyAsync(&a[offset], &d_a[offset], streamBytes, cudaMemcpyDeviceToHost, stream[i]);
}

// 顺序执行 H2D->kernel->D2H
for (int i = 0; i < nStreams; ++i) {
  int offset = i * streamSize;
  cudaMemcpyAsync(&d_a[offset], &a[offset], 
                  streamBytes, cudaMemcpyHostToDevice, cudaMemcpyHostToDevice, stream[i]);
}

for (int i = 0; i < nStreams; ++i) {
  int offset = i * streamSize;
  kernel<<<streamSize/blockSize, blockSize, 0, stream[i]>>>(d_a, offset);
}

for (int i = 0; i < nStreams; ++i) {
  int offset = i * streamSize;
  cudaMemcpyAsync(&a[offset], &d_a[offset], 
                  streamBytes, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToHost, stream[i]);
}

上述两种并行优化和串行相比：

在Tesla C1060上的表现：

Time for sequential transfer and execute (ms ): 12.92381 max error : 2.3841858E -07 Time for asynchronous V1 transfer and execute (ms ): 13.63690 max error : 2.3841858E -07 Time for asynchronous V2 transfer and execute (ms ): 8.84588 max error : 2.3841858E -07

第一种优化几乎没有变化，第二种优化是串行时候的 2/3

C1060 只有两个 engine 分别负责数据传输和核函数执行，所以分别是上图这样。

在Tesla C2050上的表现：

Time for sequential transfer and execute (ms ): 9.984512 max error : 1.1920929e -07 Time for asynchronous V1 transfer and execute (ms ): 5.735584 max error : 1.1920929e -07 Time for asynchronous V2 transfer and execute (ms ): 7.597984 max error : 1.1920929e -07

第一种优化是串行时候的 1/2，第二种优化是串行时候的 2/3

C2050 有三个 engine，分别负责 H2D 数据传输，核函数执行和 D2H 数据传输。在 C2050 的 version2 中，D2H 讲道理不应该被 Kernel 阻塞，但是出现了这样的情况，原因是当前后接连执行在不同的流上面的核函数时，GPU想要尽可能地使得计算能够并行化，所以将启动 D2H 的信号延后到了所有的 kernel function 执行结束。原话如下：

When multiple kernels are issued back-to-back in different (non-default) streams, the scheduler tries to enable concurrent execution of these kernels and as a result delays a signal that normally occurs after each kernel completion (which is responsible for kicking off the device-to-host transfer) until all kernels complete. So, while there is overlap between host-to-device transfers and kernel execution in the second version of our asynchronous code, there is no overlap between kernel execution and device-to-host transfers.

好消息是 compute capability 在 3.5 以上的设备具有 Hyper-Q 特性，（貌似是可以自动优化执行的顺序不需要人工去调整），所以上面的两种并行优化方法所得到的结果都会一样的： 在Tesla K20c上的表现：

Time for sequential transfer and execute (ms): 7.101760 max error : 1.1920929e -07 Time for asynchronous V1 transfer and execute (ms): 3.974144 max error : 1.1920929e -07 Time for asynchronous V2 transfer and execute (ms): 3.967616 max error : 1.1920929e -07

可以看出都是串行所需时间的一半左右。

PreviousOptimize_Data_Transfers_in_CUDA NextWrite_Flexible_Kernels_with_Grid-Stride_Loops

Last updated 5 years ago

Was this helpful?