using mkl::blas::gemm;
int64_t n = 32;
device dev({host, cpu, gpu}_selector());
queue Q(dev);
size_t bytes = n*n*sizeof(double);
double* A = ...;
double* B = ...;
double* C = ...;
buffer<double, 1> A_buf{A, range<1>(n*n)};
buffer<double, 1> B_buf{B, range<1>(n*n)};
buffer<double, 1> C_buf{C, range<1>(n*n)};
gemm(Q, mkl::transpose::N, mkl::transpose::N,
n, n, n, 1.0, A_buf, n, B_buf, n, 0.0, C_buf, n);
使用 USM API:用户定义了数据的共享存在方式,需要对任务完成与否负责
using mkl::blas::gemm;
int64_t n = 32;
device dev({host, cpu, gpu}_selector());
queue Q(dev);
size_t bytes = n*n*sizeof(double);
double* A = malloc_shared(bytes, dev, Q.get_context());
double* B = malloc_shared(bytes, dev, Q.get_context());
double* C = malloc_shared(bytes, dev, Q.get_context());
gemm(Q, mkl::transpose::N, mkl::transpose::N,
n, n, n, 1.0, A, n, B, n, 0.0, C, n);
Q.wait_and_throw(); // 等待任务对数据的写入结束
USM is bind to specific queue, not accessible to other queue.
int *data = malloc_shared<int>(N, q);
// or
int *data = static_cast<int*>(malloc_shared(N*sizeof(int), q));
Usage
Before using USM, the SYCL provides the buffer method:
queue q;
// allocating data on host
int *data = static_cast<int*>(malloc(N*sizeof(int), q));
// initialize on host
for (int i = 0; i < n; i++) {
data[i] = 10;
}
// create a scope for auto destroy
{
buffer<int, 1> buf(data, range<1>(N));
q.submit([&](handler& h) {
// need the help of accessor to access data on device
auto A = buf.get_access<access::mode::read_write>(h);
h.parallel_for(range<1>(N), [=](id<1> i) {
A[i] += 1;
});
});
} // auto buffer destruction, updating data on host from device
// accessible on host again after auto buffer destruction
for (int i = 0; i < N; i++) {
std::cout << data[i] << std::endl;
}
// destroy data on USM
free(data);
After using USM, developer does not need to care the data flow:
queue q;
// allocating data on USM
auto data = malloc_shared<int>(N, q); // specify queue
// initialize on host
for (int i = 0; i < n; i++) {
data[i] = 10;
}
// accessible on device
q.parallel_for(N, [=](auto i) {
data[i] += 1;
}).wait; // Need to add an explicit wait for the data transfer
// accessible on host again after wait finished
for (int i = 0; i < N; i++) {
std::cout << data[i] << std::endl;
}
// destroy data on USM
free(data, q); // specify queue
Ways
sycl::malloc_device: allocate memory on device, no accessible to host, need explicit data copy to access on host.
sycl::malloc_host: allocate memory on host, accessible to device via Direct Memory Access.
sycl::malloc_shared: allocate memory which can migrate between device and host by program automatically and implicitly. (Need wait to ensure the only access from host or device)
Example:
queue q;
int data[N];
for (int i = 0; i < N; i++) {
data[i] == 10;
}
// allocate on device
int *data_device = malloc_device<int>(N, q);
// memory copy from host to device
q.memcpy(data_device, data, sizeof(int)*N).wait()
// access data on device
q.parallel_for(N, [=](auto i) {
data_device[i] += 1;
}).wait();
// memory copy back from device to host
q.memcpy(data, data_device, sizeof(int)*N).wait();
for (int i = 0; i < N; i++) {
std::cout << data[i] << std::endl;
}
data dependency
in_order feature will make all task serialized, avoiding the data race.
Example:
queue q{property::queue::in_order()};
int *data = malloc_shared<int>(N, q);
for (int i = 0; i < N; i++) {
data[i] = 10;
}
q.parallel_for(N, [=](auto i) {
data[i] += 2;
});
q.parallel_for(N, [=](auto i) {
data[i] += 3;
});
q.parallel_for(N, [=](auto i) {
data[i] += 4;
});
for (int i = 0; i < N; i++) {
std::cout << data[i] << std::endl;
}
free(data, q);
depends_on can avoid making the situation the extreme by serializing all tasks.
Example:
queue q;
int *data1 = malloc_shared<int>(N, q);
int *data2 = malloc_shared<int>(N, q);
for (int i = 0; i < N; i++) {
data1[i] = 10;
data2[i] = 10;
}
auto e1 = q.parallel_for(N, [=](auto i) {
data1[i] += 2;
});
auto e2 = q.parallel_for(N, [=](auto i) {
data2[i] += 3;
});
q.submit([&](handler& h) {
h.depends_on({e1, e2});
h.parallel_for(N, [=](auto i) {
data1[i] += data2[i];
});
}).wait();
for (int i = 0; i < N; i++) {
std::cout << data[i] << std::endl;
}
free(data1, q);
free(data2, q);
Sub Groups
A sub-group is a collection of contiguous work-items in the global index space that execute in the same VE thread. When the device compiler compiles the kernel, multiple work-items are packed into a sub-group by vectorization so the generated SIMD instruction stream can perform tasks of multiple work-items simultaneously. Properly partitioning work-items into sub-groups can make a big performance difference.
work-item in Sub-group can communicate with each other using shuffle operations, avoiding accessing local or global memory, providing higher performance.
work-item in Sub-group can access sub-group group algorithm which implements frequently used parallel strategies.
sycl::shift_group_left(sg, x, 1)
sycl::shift_group_right(sg, x, 1)
sycl::select_from_group(sg, x, id)
sycl::permute_group_by_xor(sg, x, mask)
sycl::group_broadcast(sg, x, id)
sycl::reduce_over_group(sg, x, op)
sycl::exclusive_can_over_group(sg, x, op)
sycl::inclusive_can_over_group(sg, x, op)
sycl::any_off_group(sg, x)
sycl::all_off_group(sg, x)
sycl::none_off_group(sg, x)
Example of sub-group group algorithm:
h.parallel_for(nd_range<1>(N, B), [=](nd_item<1> item) {
auto sg = item.get_sub_group_by();
auto i = item.get_global_id(0);
/*Group Algorithm*/
data[i] = reduce_over_group(sg, data[i], plus<>());
});
Some attribute function of work item:
get_local_id() returns the index(es) of the work-item within its work-group
get_local_range() returns the size(s) of work-group
get_global_id() returns the index(es) of the work-item within the global size
get_global_range() returns the global size
get_group_id() returns the index(es) of the work-group
get_group_range() returns the size(s) of work-group within the parent global size
These attributes return the id<n> data, thus needing the index to access(e.g. item.get_local_id()[0]), or passing the index to function as a parameter to fetch the one(e.g. item.get_local_id(0)).
Example:
#include <CL/sycl.hpp>
using namespace sycl;
static constexpr size_t N = 8; // global size
int main() {
queue q;
std::cout << "Device : " << q.get_device().get_info<info::device::name>() << "\n";
q.submit([&](handler &h) {
// need to initialize a stream in device to print, std::cout is not applicable for device
auto cout = stream(1024, 768, h);
h.parallel_for(nd_range<3>({N, N, N}, {2, 2, 2}), [=](nd_item<3> item) {
auto local_id = item.get_local_id();
auto local_range = item.get_local_range();
auto global_id = item.get_global_id();
auto global_range = item.get_global_range();
auto group = item.get_group();
auto group_range = item.get_group_range();
// Specify the ones to be printed, otherwise randomly choose some
if (item.get_group_linear_id() == 0) {
cout << "Local 3D(" << local_id[0] << "," << local_id[1] << "," << local_id[2] << ") ";
cout << "Local R(" << local_range[0] << "," << local_range[1] << "," << local_range[2] << ") ";
cout << "Global 3D(" << global_id[0] << "," << global_id[1] << "," << global_id[2] << ") ";
cout << "Global R(" << global_range[0] << "," << global_range[1] << "," << global_range[2] << ") ";
cout << "Group 3D(" << group[0] << "," << group[1] << "," << group[2] << ") ";
cout << "Group R(" << group_range[0] << "," << group_range[1] << "," << group_range[2] << ") ";
cout << "Global index:" << global_id[0]+N*(global_id[1]+N*global_id[2]) << " ";
cout << "LLid:" << item.get_local_linear_id() << ", GLid:" << item.get_global_linear_id() << ", GrpLid:" << item.get_group_linear_id() << "\n";
}
});
});
}
-------->
## u162308 is compiling DPCPP_Essentials Module4 -- DPCPP Sub Groups - 1 of 7 sub_group_info.cpp
Device : Intel(R) UHD Graphics P630 [0x3e96]
Local 3D(0,0,0) Local R(2,2,2) Global 3D(0,0,0) Global R(8,8,8) Group 3D(0,0,0) Group R(4,4,4) Global index:0 LLid:0, GLid:0, GrpLid:0
Local 3D(0,0,1) Local R(2,2,2) Global 3D(0,0,1) Global R(8,8,8) Group 3D(0,0,0) Group R(4,4,4) Global index:64 LLid:1, GLid:1, GrpLid:0
Local 3D(0,1,0) Local R(2,2,2) Global 3D(0,1,0) Global R(8,8,8) Group 3D(0,0,0) Group R(4,4,4) Global index:8 LLid:2, GLid:8, GrpLid:0
Local 3D(0,1,1) Local R(2,2,2) Global 3D(0,1,1) Global R(8,8,8) Group 3D(0,0,0) Group R(4,4,4) Global index:72 LLid:3, GLid:9, GrpLid:0
Local 3D(1,0,0) Local R(2,2,2) Global 3D(1,0,0) Global R(8,8,8) Group 3D(0,0,0) Group R(4,4,4) Global index:1 LLid:4, GLid:64, GrpLid:0
Local 3D(1,0,1) Local R(2,2,2) Global 3D(1,0,1) Global R(8,8,8) Group 3D(0,0,0) Group R(4,4,4) Global index:65 LLid:5, GLid:65, GrpLid:0
Local 3D(1,1,0) Local R(2,2,2) Global 3D(1,1,0) Global R(8,8,8) Group 3D(0,0,0) Group R(4,4,4) Global index:9 LLid:6, GLid:72, GrpLid:0
But it is not the same as in sub-group, the functions below all return type id<1>, needing to retrieve by indexing ret[0]:
get_local_id() returns the id of work item in its sub-group.
get_local_range() returns the range of items in its sub-group.
get_group_id() returns the id of its sub-group in the work-group.
get_group_range() returns the range of sub-groups in the work-group.
Sub-group size will be chosen by compiler implicitly, but also can be assigned with supported size(related to hardware). Query the supported sizes by get_info:
auto sizes = q.get_device().get_info<info::device::sub_group_sizes>();
User can define the size used for sub group by adding [[intel::reqd_sub_group_size(8)]]:
q.submit([&](handler &h) {
//# setup sycl stream class to print standard output from device code
auto out = stream(1024, 768, h);
//# nd-range kernel with user specified sub_group size
h.parallel_for(nd_range<1>(N, B), [=](nd_item<1> item)[[intel::reqd_sub_group_size(8)]] {
//# get sub_group handle
auto sg = item.get_sub_group();
//# query sub_group and print sub_group info once per sub_group
if (sg.get_local_id()[0] == 0) {
out << "sub_group id: " << sg.get_group_id()[0] << " of "
<< sg.get_group_range()[0] << ", size=" << sg.get_local_range()[0]
<< "\n";
}
});
}).wait();
How to implement the get_sub_group_linear_id() function?
sub_group_linear_id = sub_group_id in work_group + sub_group_range * work_group_id
q.parallel_for(nd_range<1>(N, B), [=](nd_item<1> item) {
auto sg = item.get_sub_group();
auto sg_linear_id = sg.get_group_id() + sg.get_group_range() * item.get_group()[0];
}).wait();
Reduction
Before introducing simplified reduction in SYCL 2020, a reduction in global size is implemented with 2 steps, reduction on work group and accumulation on work-group result:
q.parallel_for(nd_range<1>(N, B), [=](nd_item<1> item) {
auto wg = item.get_group();
auto i = item.get_global_id();
int sum_wg = reduce_over_group(wg, data[i], plus<>());
if (item.get_local_id(0) == 0) { // item local id in work group, stride = B
data[i] = sum_wg;
}
});
q.single_task([=]() {
int sum = 0;
for (int i = 0; i < N; i+=B) { // stride = B
sum += data[i];
}
data[0] = sum;
});
After simplified reduction, we can use reduction object to make this thing:
queue q;
auto data = malloc_shared<int>(N, q);
for (int i = 0; i < N; i++) {
data[i] = i;
}
auto sum = malloc_shared<int>(1, q);
sum[0] = 0;
q.parallel_for(nd_range<1>(N, B),
reduction(sum, plus<>()),
[=](nd_item<1> item, auto& sum) {
int i = item.get_global_id(0);
sum += data[i];
}).wait();
std::cout << "Sum = " << sum[0] << std::endl;
free(data, q);
free(sum, q);
Multiple Reductions in one kernel:
{
//# create buffers
buffer buf_data(data);
buffer buf_sum(&sum, range(1));
buffer buf_min(&min, range(1));
buffer buf_max(&max, range(1));
q.submit([&](handler& h) {
//# create accessors for data and results
accessor acc_data(buf_data, h, read_only);
//# define reduction objects for sum, min, max reduction
auto reduction_sum = reduction(buf_sum, h, plus<>());
auto reduction_min = reduction(buf_min, h, minimum<>());
auto reduction_max = reduction(buf_max, h, maximum<>());
//# parallel_for with multiple reduction objects
h.parallel_for(nd_range<1>{N, B}, 1.2, reduction_sum, reduction_min, reduction_max, [=](nd_item<1> it, auto& temp_sum, auto& temp_min, auto& temp_max) {
auto i = it.get_global_id();
temp_sum.combine(acc_data[i]);
temp_min.combine(acc_data[i]);
temp_max.combine(acc_data[i]);
});
});
}
L4 Profiler
VTune Profiler
Intel® VTune™ Profiler Performance Analysis Cookbook: Link
#!/bin/bash# To run the GPU Roofline analysis in the Intel® Advisor CLI:# 1. Run the Survey analysis with the --enable-gpu-profiling option: advixe-cl–collect=survey--enable-gpu-profiling--project-dir=./adv--./matrix.dpcpp# 2. Run the Trip Counts and FLOP analysis with --enable-gpu-profiling option: advixe-cl-–collect=tripcounts--stacks--flop--enable-gpu-profiling--project-dir=./adv--./matrix.dpcpp# 3. Generate a GPU Roofline report: advixe-cl--report=roofline--gpu--project-dir=./adv# 4. Open the generated roofline.html in a web browser to visualize GPU performance.
Vectorization
Offload Modeling
The easiest way to run Offload Advisor is to use the batch mode that consists in running 2 scripts available is the folder $APM ($APM is available when Advisor is sourced).
collect.py: Used to collect data such as timing, flops, tripcounts and many more
analyze.py: Creating the report
To be more specific, collect.py runs the following analyses:
survey: Timing your application functions and loops, reading compiler diagnostics
tripcount: With flops and cache simulation to count the number of iterations in the loops as well as the number of operations and memory transfers
dependency: Check if you have data dependency in your loops, preventing it to be good candidates for offloading or vectorization
Offload Advisor is currently run from the command-line as below. Once the run is complete you can view the generated report.html.
An Execution Unit (EU) is the smallest thread-level building block of the Intel® Iris® Xe-LP GPU architecture. Each EU is simultaneously multithreaded (SMT) with seven threads. The primary computation unit consists of a 8-wide Single Instruction Multiple Data (SIMD) Arithmetic Logic Units (ALU) supporting SIMD8 FP/INT operations and a 2-wide SIMD ALU supporting SIMD2 extended math operations. Each hardware thread has 128 general-purpose registers (GRF) of 32B wide.