CUDA_7_Streams_Simplify_Concurrency

CUDA 7 Streams Simplify Concurrency

2020 Jan 27th CookieLau

Source: https://devblogs.nvidia.com/gpu-pro-tip-cuda-7-streams-simplify-concurrency/

[toc]

Preface

在 Cuda 7 之前，所有的没有指定 Stream流的 Device 核函数调用都是在默认流中执行，导致串行 Serialization 。默认流是一种特殊的流，他会 隐式地 同步 Device 上的其他所有流。在 Cuda 7 开始，可以让每个 host 上的线程都拥有自己的一个独立的 Stream。 (why for host thread?)

Asynchronous Commands in CUDA

异步指令是指 host 在执行指令之后，立刻重新获得控制权，无需等被调用的异步指令执行完成，有以下：

Kernel launches; 核函数调用
Memory copies between two addresses to the same device memory; 内存从两个不同的地址拷贝到同一个设备的内存中
Memory copies from host to device of a memory block of 64 KB or less; 从host到device中的内存拷贝
Memory copies performed by functions with the Async suffix; 执行带有 Async 后缀的内存拷贝函数的调用
Memory set function calls. ( 分配内存？)

默认流的编号是 0，所以调用核函数时，不指定特定流和指令到0号流都是将核函数的执行分配到默认流上，如：

  kernel<<< blocks, threads, bytes >>>();    // default stream
  kernel<<< blocks, threads, bytes, 0 >>>(); // stream 0

以上两条都是在默认流中执行核函数的示例。

The Default Stream

Cuda 7 之前，每个设备有且仅有一个 default Stream，提供给所有的 host thread 使用，这就造成了 隐式同步。 隐式同步 是指：来自于不同的流的两条指令会因为中间有 host thread 在他们之间执行了 在默认流上运行的核函数 而无法并行化

Cuda 7 提出了一个叫做 per-thread default stream 的新特性解决了这一问题。其有两个性质： 1. 每个 host thread 有自己的 default stream，这使得不同的 host thread 在 default stream 上执行的指令可以并行 (因为他们不共享 default stream)。 2. 每个 host thread 所拥有的 default stream 是一个 regular stream，即和自己用 cudaStreamCreate 创建的非默认流是同等级别的，可以实现并行。

per-thread default stream 使用方法有两种，任选其一即可： 1. (推荐使用) 不改动源程序，在编译的时候加上 --default-stream per-thread ，如：

nvcc --default-stream per-thread ./stream_test.cu -o stream_per-thread

(不推荐使用) 在导入CUDA头文件(cuda.h or cuda_runtime.h)之前 #define CUDA_API_PER_THREAD_DEFAULT_STREAM

注 ⚠️：

第二种方法在使用 nvcc 编译器 的时候是无效的，因为 nvcc 会在编译的时候自己在 cu 文件的第一行加上 #include <cuda_runtime.h>，所以无法做到在导入CUDA头文件之前 #define。

cudaCreateStreamWithFlag

除了上面的在编译时添加参数使得每个 host thread 有自己的 default stream 之外，我们还可以在创建非默认流的时候加上合适的参数使得非默认流和默认流之间不是阻塞（blocking）的关系，具体如下：

__host__
cudaError_t cudaCreateStreamWithFlags(cudaStream_t *stream, optionFlag);

@Params
optionFlag:
1. cudaStreamDefault 
    * cudaStreamCreate 的默认就是这个 default，会和默认流发生阻塞
2. cudaStreamNonBlocking
    * 创建 NonBlocking 的非阻塞流，不会和默认流发生阻塞

Example：

cudaStream_t stream1, stream2;

cudaStreamCreateWithFlags(&stream1, cudaStreamNonBlocking);
cudaStreamCreateWithFlags(&stream2, cudaStreamNonBlocking);

cudaEventRecord(startEvent);
addArraysInto_1<<<numberOfBlocks, threadsPerBlock, 0, stream1>>>(c, a, b, N);
addArraysInto_2<<<numberOfBlocks, threadsPerBlock>>>(d, a, b, N);
addArraysInto_3<<<numberOfBlocks, threadsPerBlock, 0, stream2>>>(e, a, b, N);
cudaEventRecord(stopEvent);
cudaEventSynchronize(stopEvent);

cudaEventElapsedTime(&time, startEvent, stopEvent);
printf("NonBlocking time cost:%f\n",time*1e3);

测试结果是 NonBlocking 确实会快 1/3

A Multi-Stream Example

    for (int i = 0; i < num_streams; i++) {
        cudaStreamCreate(&streams[i]);

        cudaMalloc(&data[i], N * sizeof(float));

        // launch one worker kernel per stream 调用一个在非默认流上运行的核函数
        kernel<<<1, 64, 0, streams[i]>>>(data[i], N); 

        // launch a dummy kernel on the default stream 调用一个在默认流上运行的核函数
        kernel<<<1, 1>>>(0, 0);
    }

在普通的编译下：
```
nvcc ./stream_test.cu -o stream_legacy
```
得到的 nvvp 图像是这样的：可以看出每个 worker kernel 都被中间的很小的默认流上的 dummy kernel 给阻塞了，导致整体 serial。
在带有 option 的编译下：
```
nvcc --default-stream per-thread ./stream_test.cu -o stream_per-thread
```
得到的 nvvp 图像是这样的：可以看出 worker kernel 实现了并行，其中 stream 14 是 dummy kernel 的函数

解析：在普通的编译情况下，所有的 host 操作都共享一个 default stream，这导致了只要 default stream 中有东西，其他的非默认流就不能并行；但是在 per-thread 的情况下，每个分配到不同的默认流的 host command，即 kernel<<<1, 64, 0, streams[i]>>>(data[i], N); 都拥有了属于自己的 default stream，所以不需要管所谓的共有的 default stream了。而没有指定默认流的 kernel<<<1, 1>>>(0, 0); 相当于指定了 0号 stream，所以所有的 dummy 都被安排到 stream 14 中去了，这里的 stream 14 和其他的 stream 是同等关系，也证明了 per-thread option 的第二个性质，每个 default stream 和 non-default stream 现在是同等的关系。

上面是多个流的情况，下面这里是多线程的example，更加细粒度的看清楚什么是 per-thread 的性质。

A Multi-threading Example

void *launch_kernel(void *dummy)
{
    float *data;
    cudaMalloc(&data, N * sizeof(float));

    kernel<<<1, 64>>>(data, N); //lunch in default stream

    cudaStreamSynchronize(0); //manual synchronize

    return NULL;
}

int main()
{
    const int num_threads = 8;

    pthread_t threads[num_threads];

    for (int i = 0; i < num_threads; i++) {
        if (pthread_create(&threads[i], NULL, launch_kernel, 0)) {
            fprintf(stderr, "Error creating threadn");
            return 1;
        }
    }

    for (int i = 0; i < num_threads; i++) {
        if(pthread_join(threads[i], NULL)) {
            fprintf(stderr, "Error joining threadn");
            return 2;
        }
    }

    cudaDeviceReset();

    return 0;
}

普通的编译：

nvcc ./pthread_test.cu -o pthreads_legacy

用 nvvp 查看：可以看出由于 cudaStreamSynchronize(0) 的存在，不同的线程之间变成了一个串行的都在 default stream 中运行的情况。

per-thread 的编译：

nvcc --default-stream per-thread ./pthread_test.cu -o pthreads_per_thread

用 nvvp 查看：可以看出每个线程都有自己的 default stream 实现了并行。因为不同的线程之前不共享所以 cudaStreamSynchronize(0) 并没有对其他的 pthread 进行阻塞。

说明 per-thread 的编译选项确实是对细粒度的 pthread 进行了不同的 default stream 的分配。

More Tips

当进行并行编程的时候需要注意的地方：

Remember: With per-thread default streams, the default stream in each thread behaves the same as a regular stream, as far as synchronization and concurrency goes. This is not true with the legacy default stream.
The --default-stream option is applied per compilation unit, so make sure to apply it to all nvcc command lines that need it.
cudaDeviceSynchronize() continues to synchronize everything on the device, even with the new per-thread default stream option. If you want to only synchronize a single stream, use cudaStreamSynchronize(cudaStream_t stream), as in our second example.
Starting in CUDA 7 you can also explicitly access the per-thread default stream using the handle cudaStreamPerThread, and you can access the legacy default stream using the handle cudaStreamLegacy. Note that cudaStreamLegacy still synchronizes implicitly with the per-thread default streams if you happen to mix them in a program.
You can create non-blocking streams which do not synchronize with the legacy default stream by passing the cudaStreamNonBlocking flag to cudaStreamCreate().

PreviousMemory Classification NextOptimize_Data_Transfers_in_CUDA

Last updated 5 years ago

Was this helpful?