using mkl::blas::gemm;int64_t n =32;devicedev({host,cpu,gpu}_selector());queueQ(dev);size_t bytes = n*n*sizeof(double);double* A = ...;double* B = ...;double* C = ...;buffer<double,1> A_buf{A,range<1>(n*n)};buffer<double,1> B_buf{B,range<1>(n*n)};buffer<double,1> C_buf{C,range<1>(n*n)};gemm(Q, mkl::transpose::N, mkl::transpose::N, n, n, n,1.0, A_buf, n, B_buf, n,0.0, C_buf, n);
使用 USM API:用户定义了数据的共享存在方式,需要对任务完成与否负责
using mkl::blas::gemm;int64_t n =32;devicedev({host,cpu,gpu}_selector());queueQ(dev);size_t bytes = n*n*sizeof(double);double* A =malloc_shared(bytes, dev,Q.get_context());double* B =malloc_shared(bytes, dev,Q.get_context());double* C =malloc_shared(bytes, dev,Q.get_context());gemm(Q, mkl::transpose::N, mkl::transpose::N, n, n, n,1.0, A, n, B, n,0.0, C, n);Q.wait_and_throw(); // 等待任务对数据的写入结束
Before using USM, the SYCL provides the buffer method:
queue q;// allocating data on hostint*data =static_cast<int*>(malloc(N*sizeof(int), q));// initialize on hostfor (int i =0; i < n; i++) {data[i] =10;}// create a scope for auto destroy{ buffer<int,1>buf(data,range<1>(N));q.submit([&](handler& h) { // need the help of accessor to access data on deviceauto A =buf.get_access<access::mode::read_write>(h);h.parallel_for(range<1>(N), [=](id<1> i) {A[i] +=1; }); });} // auto buffer destruction, updating data on host from device// accessible on host again after auto buffer destructionfor (int i =0; i < N; i++) { std::cout <<data[i] << std::endl;}// destroy data on USMfree(data);
After using USM, developer does not need to care the data flow:
queue q;// allocating data on USMauto data =malloc_shared<int>(N, q); // specify queue// initialize on hostfor (int i =0; i < n; i++) {data[i] =10;}// accessible on deviceq.parallel_for(N, [=](auto i) {data[i] +=1;}).wait; // Need to add an explicit wait for the data transfer// accessible on host again after wait finishedfor (int i =0; i < N; i++) { std::cout <<data[i] << std::endl;}// destroy data on USMfree(data, q); // specify queue
Ways
sycl::malloc_device: allocate memory on device, no accessible to host, need explicit data copy to access on host.
sycl::malloc_host: allocate memory on host, accessible to device via Direct Memory Access.
sycl::malloc_shared: allocate memory which can migrate between device and host by program automatically and implicitly. (Need wait to ensure the only access from host or device)
Example:
queue q;intdata[N];for (int i =0; i < N; i++) {data[i] ==10;}// allocate on deviceint*data_device =malloc_device<int>(N, q);// memory copy from host to deviceq.memcpy(data_device, data,sizeof(int)*N).wait()// access data on deviceq.parallel_for(N, [=](auto i) {data_device[i] +=1;}).wait();// memory copy back from device to hostq.memcpy(data, data_device,sizeof(int)*N).wait();for (int i =0; i < N; i++) { std::cout <<data[i] << std::endl;}
data dependency
in_order feature will make all task serialized, avoiding the data race.
Example:
queue q{property::queue::in_order()};int*data =malloc_shared<int>(N, q);for (int i =0; i < N; i++) {data[i] =10;}q.parallel_for(N, [=](auto i) {data[i] +=2;});q.parallel_for(N, [=](auto i) {data[i] +=3;});q.parallel_for(N, [=](auto i) {data[i] +=4;});for (int i =0; i < N; i++) { std::cout <<data[i] << std::endl;}free(data, q);
depends_on can avoid making the situation the extreme by serializing all tasks.
A sub-group is a collection of contiguous work-items in the global index space that execute in the same VE thread. When the device compiler compiles the kernel, multiple work-items are packed into a sub-group by vectorization so the generated SIMD instruction stream can perform tasks of multiple work-items simultaneously. Properly partitioning work-items into sub-groups can make a big performance difference.
work-item in Sub-group can communicate with each other using shuffle operations, avoiding accessing local or global memory, providing higher performance.
work-item in Sub-group can access sub-group group algorithm which implements frequently used parallel strategies.
get_local_id() returns the index(es) of the work-item within its work-group
get_local_range() returns the size(s) of work-group
get_global_id() returns the index(es) of the work-item within the global size
get_global_range() returns the global size
get_group_id() returns the index(es) of the work-group
get_group_range() returns the size(s) of work-group within the parent global size
These attributes return the id<n> data, thus needing the index to access(e.g. item.get_local_id()[0]), or passing the index to function as a parameter to fetch the one(e.g. item.get_local_id(0)).
Example:
#include<CL/sycl.hpp>usingnamespace sycl;staticconstexprsize_t N =8; // global sizeintmain() { queue q; std::cout <<"Device : "<<q.get_device().get_info<info::device::name>() <<"\n";q.submit([&](handler&h) { // need to initialize a stream in device to print, std::cout is not applicable for deviceauto cout =stream(1024,768, h);h.parallel_for(nd_range<3>({N, N, N}, {2,2,2}), [=](nd_item<3> item) {auto local_id =item.get_local_id();auto local_range =item.get_local_range();auto global_id =item.get_global_id();auto global_range =item.get_global_range();auto group =item.get_group();auto group_range =item.get_group_range(); // Specify the ones to be printed, otherwise randomly choose someif (item.get_group_linear_id() ==0) { cout <<"Local 3D("<<local_id[0] <<","<<local_id[1] <<","<<local_id[2] <<") "; cout <<"Local R("<<local_range[0] <<","<<local_range[1] <<","<<local_range[2] <<") "; cout <<"Global 3D("<<global_id[0] <<","<<global_id[1] <<","<<global_id[2] <<") "; cout <<"Global R("<<global_range[0] <<","<<global_range[1] <<","<<global_range[2] <<") "; cout <<"Group 3D("<<group[0] <<","<<group[1] <<","<<group[2] <<") "; cout <<"Group R("<<group_range[0] <<","<<group_range[1] <<","<<group_range[2] <<") "; cout <<"Global index:"<<global_id[0]+N*(global_id[1]+N*global_id[2]) <<" "; cout << "LLid:" << item.get_local_linear_id() << ", GLid:" << item.get_global_linear_id() << ", GrpLid:" << item.get_group_linear_id() << "\n";
} }); });}-------->## u162308 is compiling DPCPP_Essentials Module4 -- DPCPP Sub Groups -1 of 7sub_group_info.cppDevice : Intel(R) UHD Graphics P630 [0x3e96]Local 3D(0,0,0) Local R(2,2,2) Global 3D(0,0,0) Global R(8,8,8) Group 3D(0,0,0) Group R(4,4,4) Global index:0 LLid:0, GLid:0, GrpLid:0
Local 3D(0,0,1) Local R(2,2,2) Global 3D(0,0,1) Global R(8,8,8) Group 3D(0,0,0) Group R(4,4,4) Global index:64 LLid:1, GLid:1, GrpLid:0
Local 3D(0,1,0) Local R(2,2,2) Global 3D(0,1,0) Global R(8,8,8) Group 3D(0,0,0) Group R(4,4,4) Global index:8 LLid:2, GLid:8, GrpLid:0
Local 3D(0,1,1) Local R(2,2,2) Global 3D(0,1,1) Global R(8,8,8) Group 3D(0,0,0) Group R(4,4,4) Global index:72 LLid:3, GLid:9, GrpLid:0
Local 3D(1,0,0) Local R(2,2,2) Global 3D(1,0,0) Global R(8,8,8) Group 3D(0,0,0) Group R(4,4,4) Global index:1 LLid:4, GLid:64, GrpLid:0
Local 3D(1,0,1) Local R(2,2,2) Global 3D(1,0,1) Global R(8,8,8) Group 3D(0,0,0) Group R(4,4,4) Global index:65 LLid:5, GLid:65, GrpLid:0
Local 3D(1,1,0) Local R(2,2,2) Global 3D(1,1,0) Global R(8,8,8) Group 3D(0,0,0) Group R(4,4,4) Global index:9 LLid:6, GLid:72, GrpLid:0
But it is not the same as in sub-group, the functions below all return type id<1>, needing to retrieve by indexing ret[0]:
get_local_id() returns the id of work item in its sub-group.
get_local_range() returns the range of items in its sub-group.
get_group_id() returns the id of its sub-group in the work-group.
get_group_range() returns the range of sub-groups in the work-group.
Sub-group size will be chosen by compiler implicitly, but also can be assigned with supported size(related to hardware). Query the supported sizes by get_info:
auto sizes =q.get_device().get_info<info::device::sub_group_sizes>();
User can define the size used for sub group by adding [[intel::reqd_sub_group_size(8)]]:
q.submit([&](handler&h) { //# setup sycl stream class to print standard output from device codeauto out =stream(1024,768, h); //# nd-range kernel with user specified sub_group sizeh.parallel_for(nd_range<1>(N, B), [=](nd_item<1> item)[[intel::reqd_sub_group_size(8)]] { //# get sub_group handleauto sg =item.get_sub_group(); //# query sub_group and print sub_group info once per sub_groupif (sg.get_local_id()[0] ==0) { out <<"sub_group id: "<<sg.get_group_id()[0] <<" of "<<sg.get_group_range()[0] <<", size="<<sg.get_local_range()[0]<<"\n"; } });}).wait();
How to implement the get_sub_group_linear_id() function?
sub_group_linear_id = sub_group_id in work_group + sub_group_range * work_group_id
Before introducing simplified reduction in SYCL 2020, a reduction in global size is implemented with 2 steps, reduction on work group and accumulation on work-group result:
q.parallel_for(nd_range<1>(N, B), [=](nd_item<1> item) {auto wg =item.get_group();auto i =item.get_global_id();int sum_wg =reduce_over_group(wg,data[i],plus<>());if (item.get_local_id(0) ==0) { // item local id in work group, stride = Bdata[i] = sum_wg; }});q.single_task([=]() {int sum =0;for (int i =0; i < N; i+=B) { // stride = B sum +=data[i]; }data[0] = sum;});
After simplified reduction, we can use reduction object to make this thing:
queue q;auto data =malloc_shared<int>(N, q);for (int i =0; i < N; i++) {data[i] = i;}auto sum =malloc_shared<int>(1, q);sum[0] =0;q.parallel_for(nd_range<1>(N, B),reduction(sum,plus<>()), [=](nd_item<1> item,auto& sum) {int i =item.get_global_id(0); sum +=data[i];}).wait();std::cout <<"Sum = "<<sum[0] << std::endl;free(data, q);free(sum, q);
Multiple Reductions in one kernel:
{ //# create buffers buffer buf_data(data); buffer buf_sum(&sum,range(1)); buffer buf_min(&min,range(1)); buffer buf_max(&max,range(1));q.submit([&](handler& h) { //# create accessors for data and resultsaccessoracc_data(buf_data,h,read_only); //# define reduction objects for sum, min, max reductionauto reduction_sum =reduction(buf_sum, h,plus<>());auto reduction_min =reduction(buf_min, h,minimum<>());auto reduction_max =reduction(buf_max, h,maximum<>()); //# parallel_for with multiple reduction objects h.parallel_for(nd_range<1>{N, B}, 1.2, reduction_sum, reduction_min, reduction_max, [=](nd_item<1> it, auto& temp_sum, auto& temp_min, auto& temp_max) {
auto i =it.get_global_id();temp_sum.combine(acc_data[i]);temp_min.combine(acc_data[i]);temp_max.combine(acc_data[i]); }); }); }
L4 Profiler
VTune Profiler
Intel® VTune™ Profiler Performance Analysis Cookbook: Link
#!/bin/bash# To run the GPU Roofline analysis in the Intel® Advisor CLI:# 1. Run the Survey analysis with the --enable-gpu-profiling option: advixe-cl–collect=survey--enable-gpu-profiling--project-dir=./adv--./matrix.dpcpp# 2. Run the Trip Counts and FLOP analysis with --enable-gpu-profiling option: advixe-cl-–collect=tripcounts--stacks--flop--enable-gpu-profiling--project-dir=./adv--./matrix.dpcpp# 3. Generate a GPU Roofline report: advixe-cl--report=roofline--gpu--project-dir=./adv# 4. Open the generated roofline.html in a web browser to visualize GPU performance.
Vectorization
Offload Modeling
The easiest way to run Offload Advisor is to use the batch mode that consists in running 2 scripts available is the folder $APM ($APM is available when Advisor is sourced).
collect.py: Used to collect data such as timing, flops, tripcounts and many more
analyze.py: Creating the report
To be more specific, collect.py runs the following analyses:
survey: Timing your application functions and loops, reading compiler diagnostics
tripcount: With flops and cache simulation to count the number of iterations in the loops as well as the number of operations and memory transfers
dependency: Check if you have data dependency in your loops, preventing it to be good candidates for offloading or vectorization
Offload Advisor is currently run from the command-line as below. Once the run is complete you can view the generated report.html.
An Execution Unit (EU) is the smallest thread-level building block of the Intel® Iris® Xe-LP GPU architecture. Each EU is simultaneously multithreaded (SMT) with seven threads. The primary computation unit consists of a 8-wide Single Instruction Multiple Data (SIMD) Arithmetic Logic Units (ALU) supporting SIMD8 FP/INT operations and a 2-wide SIMD ALU supporting SIMD2 extended math operations. Each hardware thread has 128 general-purpose registers (GRF) of 32B wide.