Lectures

L1 Intro

硬件平台

计算节点：
- 计算节点：高性能刀片或者机架服务器
- 胖服务器节点：SMP架构高性能服务器，具备多CPU和大内存容量
- GPU计算节点：服务器中安装GPGPU卡，使用GPU进行运算加速
三平面组网：
- 数据网络：用于存储或者数据传输
- 计算网络：用于计算过程中的消息传递
- 管理网络：用于集群系统管理

HPC典型业务负载和应用

Intel 特性

AI量化
Sapphire Rapids: 提供 AMX 矩阵运算指令集，提供超AVX512 16倍的性能
GPU: Ponte Vecchio
- 128X Cores
- 64MB L1 cache + 408MB L2 Cache + 128GB HBM2e
Optane 持久内存
DAOS 文件存储格式
- 极高IO带宽和IOPS
- 克服给予块和POSIX接口的性能瓶颈，支持HPC/AI/BIG DATA多种业务接口
oneAPI 统一的软件堆栈
- Low-Level Libraries:
  - oneMKL: Math Kernel Library
  - oneDNN: Deep Neural Network
  - oneDTL: Data Transform Library
  - oneIPL: Image Process Library
  - oneVPL: Video Process Library
  - oneTBB: Threading Building Blocks
  - oneCCL: Collective Communication Library
  - oneDPL: DPC++ Library
  - etc
- Languages:
  - DPC++
  - SYCL
  - etc
- DPCT: Intel DPC++ Compatibility Tool
  - 协助开发者完成一次性代码迁移，从CUDA到DPC++
- CFlags:
  - Profile guided optimization(multi-step build):
    -prof-gen: generation
    -prof-use: usage
  - -fast: -ipo -O3 -no-prec-div -static -fp-mode fast
  - -fiopenmp -fopenmp-targets=spir64: offloading openmp on gpu device
  - -O2 vs -O3: in some scenarios, O2 performs better than O3
  - -mprefer-vector-width=512: LLVM cflag, indicating the preference to AVX512
  - -ipo: multi-file inter-procedural optimization, aiming at the potential optimization in caller/callee chain
  - -pad: padding the array to rearrange the memory distribution to make full use of cache
- Compilers:

Driver

Target

OpenMP support

OpenMP offload Support

icc

CPU

dpcpp

CPU,GPU,FPGA

icx

CPU,GPU

ifort

CPU

ifx

CPU,GPU

L2.1 DPC++

Intel Advisor

Offload Advisor: 预估卸载至加速器的性能
Roofline Analysis: 优化用于内存和计算的 CPU/GPU 代码
Vectorization Advisor: 添加和优化矢量化
Threading Advisor: 为非线程化应用添加有效的线程化功能
流图分析器：高校创建和分析流图

DPC++

#include <CL/sycl.cpp>
constexpr int N = 16;
using namespace sycl;

int main() {
    // On host machine
    queue q;
    int *data = malloc_shared<int>(N, q); // Allocating on USM(unified shared memory)
    
    // On accelerator
    q.parallel_for(N, [=](auto i) {
        data[i] = i;
    }).wait();

    // On host machine
    for (int i = 0; i < N; i++) {
        std::cout << data[i] << std::endl;
    }
    free(data, q);
}

Device Selector

device_selector supports selecting specific devices in runtime, executing the kernel with information provided by user heuristically. Example:

default_selector d_selector;
host_selector h_selector;
cpu_selector c_selector;
gpu_selector g_selector;
queue q(d_selector);
std::cout << "Device: " << q.get_device().get_info<info::device::name>() << std::endl;

queue

Queue is the execution of submitting commands to SYCL runtime.
Queue is the mechanism of submitting tasks to devices.
Queue:Device is M:1 relationship.

queue q;
q.submit([&](handler& h) {
    // COMMAND GROUP CODE
});

kernel

Kernel encapsulates the data and function when executing in device.
Kernel is not constructed explicitly by user.
Kernel is constructed when calling kernel scheduling functions like parallel_for.

The lambda function submitted in queue is called command group.

q.submit([&](handler& h) {
    h.parallel_for(range<1>(N), [=](id<1> i) {
        A[i] = B[i] + C[i];
    });
});

Basic Parallel Kernel

range: describe the parallel execution iteration space
id: index a single instance in parallel execution
item: a single instance, can be used to query the range and other attributes

// CPU Code
for (int i = 0; i < 1024; i++) {
    // CODE
}

// Parallel Code
h.parallel_for(range<1>(1024), [=](id<1> idx) {
    // CODE
});

h.parallel_for(range<1>(1024), [=](item<1> item) {
    auto idx = item.get_id();
    auto R = item.get_range();
    // CODE
});

ND-Range Kernel

ND-Range 内核是另一种表示并行性的方法，通过提供对本地内存的访问以及将执行映射到硬件上的计算单元实现底层性能调整。
nd_range 将整个迭代空间划分为多个较小的工作组，工作组中的工作被安排在硬件的单个计算单元上。
讲内核执行分组到工作组中，可以控制工作分配中的资源使用和负载均衡。
nd_range 的第一个参数表示整个迭代空间的大小(global_size)，第二个参数表示工作组的大小(work-group size)。

h.parallel_for(nd_range<1>(range<1>(1024) /*global size*/,
                           range<1>(64))  /*work-group size*/,
                           [=](nd_item<1> item) { /*actual work task*/
    auto idx = item.get_global_id;
    auto local_id = item.get_local_id();
    // CODE
});

缓冲区内存模式

缓冲区表示数据的所有权，不表示数据具体的存储位置
缓冲区：将数据封装在 SYCL 应用中
访问器：用于实际访问缓冲区中的数据
- 在 SYCL 图中创建数据依赖关系，以在内核执行中进行排序

对于下面的例子而言，accessor 与 handler 做绑定，handler 与 queue 绑定，决定所在的设备，于是 accessor 提供到对应设备的访问能力。

queue q;
std::vector<int> v(N, 10);
{
    buffer buf(v);
    q.submit([&](handler& h) {
        accessor a(buf, h, write_only);
        h.parallel_for(N, [=](auto i) { a[i] = i; });
    });
}
for (int i = 0; i < N; i++) {
    std::cout << v[i] << " ";
}

异步执行

SYCL 应用分为两个部分：

主机代码
内核执行图

除了同步操作，这些执行都是互相独立的

主机代码提交工作(q.submit)以构建graph图（同时自身可以执行计算操作）
内核执行和数据移动的graph图由SYCL运行时管理与主机代码异步zhixing

若存在多个内核，SYCL会根据accessor自动解决数据依赖：

int main() {
    auto R = range<1>(num);
    buffer<int> A{R}, B{R};
    queue q;

    // Kernel 1
    q.submit([&](auto& h) {
        accessor out(A, h, write_only);
        h.parallel_for(R, [=](id<1> i) {
            out[i] = i;
        });
    });
    // Kernel 2
    q.submit([&](auto& h) {
        accessor out(A, h, write_only);
        h.parallel_for(R, [=](id<1> i) {
            out[i] = i;
        });
    });
    // Kernel 3
    q.submit([&](auto& h) {
        accessor out(B, h, write_only);
        h.parallel_for(R, [=](id<1> i) {
            out[i] = i;
        });
    });
    // Kernel 4
    q.submit([&](auto& h) {
        accessor in(A, h, read_only);
        accessor inout(B, h);
        h.parallel_for(R, [=](id<1> i) {
            inout[i] *= in[i];
        });
    });

}

数据依赖图可以分析得到：

graph TD
Start -->|A| 1 -->|A| 2 -->|A| 4 --> End
Start -->|B| 3 -->|B| 4

同步操作

主机访存器

主机访问设备上的数据通过创建 host_accessor 实现。因为数据可以还在计算过程中，所以主机访存器的创建是一个阻塞的同步操作，只有等到内核均完成执行并且数据可以通过主机访存器提供给主机后才返回。

#include <CL/sycl.hpp>
using namespace sycl;
constexpr int N = 16;

int main() {
    std::vector<double> v(N, 100);
    queue q;

    buffer buf(v); // 缓冲区掌握着存储在矢量中的数据
    q.submit([&](auto& h) {
        accessor a(buf, h);
        h.parallel_for(N, [=](auto i) {
            a[i] -= 2;
        });
    });

    host_accessor b(buf, read_only); // 阻塞请求
    for (int i = 0; i < N; i++) {
        std::cout << b[i] << std::endl;
    }
    return 0;
}

缓冲区销毁

#include <CL/sycl.hpp>
using namespace sycl;
constexpr int N = 16;

void dpcpp_code(std::vector<double>& v, queue& q) {
    buffer buf(v); // 当缓冲区的创建在单独的函数范围内进行
    q.submit([&](auto& h) {
        accessor a(buf, h);
        h.parallel_for(N, [=](auto i) {
            a[i] -= 2;
        });
    });
}

int main() {
    std::vector<double> v(N, 10);
    queue q;
    dpcpp_code(v, q); 
    // 完成该函数的调用后会调用缓冲区的销毁函数(析构函数)
    // 缓冲区将放弃数据的所有权，将数据拷贝回主机内存
    // 因此主机可以直接访问 v 上计算完成的数据
    for (int i = 0; i < N; i++) {
        std::cout << v[i] << std::endl;
    }
    return 0;
}

Example

//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <CL/sycl.hpp>
#include <iomanip>
#include <vector>
// dpc_common.hpp can be found in the dev-utilities include folder.
// e.g., $ONEAPI_ROOT/dev-utilities/<version>/include/dpc_common.hpp
#include "dpc_common.hpp"
#include "Complex.hpp"

using namespace sycl;
using namespace std;

// Number of complex numbers passing to the SYCL code
static const int num_elements = 10000;

class CustomDeviceSelector : public device_selector {
 public:
  CustomDeviceSelector(std::string vendorName) : vendorName_(vendorName){};
  int operator()(const device &dev) const override {
    int device_rating = 0;
    //We are querying for the custom device specific to a Vendor and if it is a GPU device we
    //are giving the highest rating as 3 . The second preference is given to any GPU device and the third preference is given to
    //CPU device. 
    if (dev.is_gpu() & (dev.get_info<info::device::name>().find(vendorName_) !=
                        std::string::npos))
      device_rating = 3;
    else if (dev.is_gpu())
      device_rating = 2;
    else if (dev.is_cpu())
      device_rating = 1;
    return device_rating;
  };

 private:
  std::string vendorName_;
};

// in_vect1 and in_vect2 are the vectors with num_elements complex nubers and
// are inputs to the parallel function
void DpcppParallel(queue &q, std::vector<Complex2> &in_vect1,
                   std::vector<Complex2> &in_vect2,
                   std::vector<Complex2> &out_vect) {
  auto R = range(in_vect1.size());
  if (in_vect2.size() != in_vect1.size() || out_vect.size() != in_vect1.size()){ 
    std::cout << "ERROR: Vector sizes do not  match"<< "\n";
    return;
  }
  // Setup input buffers
  buffer bufin_vect1(in_vect1);
  buffer bufin_vect2(in_vect2);

  // Setup Output buffers 
  buffer bufout_vect(out_vect);

  std::cout << "Target Device: "
            << q.get_device().get_info<info::device::name>() << "\n";
  // Submit Command group function object to the queue
  q.submit([&](auto &h) {
    // Accessors set as read mode
    accessor V1(bufin_vect1,h,read_only);
    accessor V2(bufin_vect2,h,read_only);
    // Accessor set to Write mode
    accessor V3 (bufout_vect,h,write_only);
    h.parallel_for(R, [=](auto i) {
      V3[i] = V1[i].complex_mul(V2[i]);
    });
  });
  q.wait_and_throw();
}
void DpcppScalar(std::vector<Complex2> &in_vect1,
                 std::vector<Complex2> &in_vect2,
                 std::vector<Complex2> &out_vect) {
  if ((in_vect2.size() != in_vect1.size()) || (out_vect.size() != in_vect1.size())){
    std::cout<<"ERROR: Vector sizes do not match"<<"\n";
    return;
    }
  for (int i = 0; i < in_vect1.size(); i++) {
    out_vect[i] = in_vect1[i].complex_mul(in_vect2[i]);
  }
}
// Compare the results of the two output vectors from parallel and scalar. They
// should be equal
int Compare(std::vector<Complex2> &v1, std::vector<Complex2> &v2) {
  int ret_code = 1;
  if(v1.size() != v2.size()){
    ret_code = -1;
  }
  for (int i = 0; i < v1.size(); i++) {
    if (v1[i] != v2[i]) {
      ret_code = -1;
      break;
    }
  }
  return ret_code;
}
int main() {
  // Declare your Input and Output vectors of the Complex2 class
  vector<Complex2> input_vect1;
  vector<Complex2> input_vect2;
  vector<Complex2> out_vect_parallel;
  vector<Complex2> out_vect_scalar;

  for (int i = 0; i < num_elements; i++) {
    input_vect1.push_back(Complex2(i + 2, i + 4));
    input_vect2.push_back(Complex2(i + 4, i + 6));
    out_vect_parallel.push_back(Complex2(0, 0));
    out_vect_scalar.push_back(Complex2(0, 0));
  }

  // Initialize your Input and Output Vectors. Inputs are initialized as below.
  // Outputs are initialized with 0
  try {
    // Pass in the name of the vendor for which the device you want to query
    std::string vendor_name = "Intel";
    // std::string vendor_name = "AMD";
    // std::string vendor_name = "Nvidia";
    // queue constructor passed exception handler
    CustomDeviceSelector selector(vendor_name);
    queue q(selector, dpc_common::exception_handler);
    // Call the DpcppParallel with the required inputs and outputs
    DpcppParallel(q, input_vect1, input_vect2, out_vect_parallel);
  } catch (...) {
    // some other exception detected
    std::cout << "Failure" << "\n";
    std::terminate();
  }

  std::cout
      << "****************************************Multiplying Complex numbers "
         "in Parallel********************************************************"
      << "\n";
  // Print the outputs of the Parallel function
  int indices[]{0, 1, 2, 3, 4, (num_elements - 1)};
  constexpr size_t indices_size = sizeof(indices) / sizeof(int);

  for (int i = 0; i < indices_size; i++) {
    int j = indices[i];
    if (i == indices_size - 1) std::cout << "...\n";
    std::cout << "[" << j << "] " << input_vect1[j] << " * " << input_vect2[j]
              << " = " << out_vect_parallel[j] << "\n";
  }
  // Call the DpcppScalar function with the required input and outputs
  DpcppScalar(input_vect1, input_vect2, out_vect_scalar);

  // Compare the outputs from the parallel and the scalar functions. They should
  // be equal

  int ret_code = Compare(out_vect_parallel, out_vect_scalar);
  if (ret_code == 1) {
    std::cout << "Complex multiplication successfully run on the device"
              << "\n";
  } else
    std::cout
        << "*********************************************Verification Failed. Results are "
           "not matched**************************"
        << "\n";

  return 0;
}

L2.2 oneMKL

运行库链接方式：查询onemkl link line advisor

Contacts: [email protected]

USM (Unified Shared Memory)

类似于 CUDA 中的 Unified Memory。

使用 Buffer API：buf 规定了数据的所有权，用户无需关心函数执行结果

using mkl::blas::gemm;
int64_t n = 32;

device dev({host, cpu, gpu}_selector());
queue Q(dev);

size_t bytes = n*n*sizeof(double);
double* A = ...;
double* B = ...;
double* C = ...;

buffer<double, 1> A_buf{A, range<1>(n*n)};
buffer<double, 1> B_buf{B, range<1>(n*n)};
buffer<double, 1> C_buf{C, range<1>(n*n)};

gemm(Q, mkl::transpose::N, mkl::transpose::N,
     n, n, n, 1.0, A_buf, n, B_buf, n, 0.0, C_buf, n);

使用 USM API：用户定义了数据的共享存在方式，需要对任务完成与否负责

using mkl::blas::gemm;
int64_t n = 32;

device dev({host, cpu, gpu}_selector());
queue Q(dev);

size_t bytes = n*n*sizeof(double);
double* A = malloc_shared(bytes, dev, Q.get_context());
double* B = malloc_shared(bytes, dev, Q.get_context());
double* C = malloc_shared(bytes, dev, Q.get_context());

gemm(Q, mkl::transpose::N, mkl::transpose::N,
     n, n, n, 1.0, A, n, B, n, 0.0, C, n);

Q.wait_and_throw(); // 等待任务对数据的写入结束

L3.1 Compiler Guidance

Frequently used commands:
- sycl-ls: listing available devices
- export SYCL_DEVICE_FILTER={CPU|GPU|FPGA}
Helpful links

L3.2 SYCL Features

USM

Setup

USM is bind to specific queue, not accessible to other queue.

int *data = malloc_shared<int>(N, q);
// or
int *data = static_cast<int*>(malloc_shared(N*sizeof(int), q));

Usage

Before using USM, the SYCL provides the buffer method:

queue q;

// allocating data on host
int *data = static_cast<int*>(malloc(N*sizeof(int), q));

// initialize on host
for (int i = 0; i < n; i++) {
    data[i] = 10;
}

// create a scope for auto destroy
{
    buffer<int, 1> buf(data, range<1>(N));
    q.submit([&](handler& h) {
        // need the help of accessor to access data on device
        auto A = buf.get_access<access::mode::read_write>(h);
        h.parallel_for(range<1>(N), [=](id<1> i) {
            A[i] += 1;
        });
    });
} // auto buffer destruction, updating data on host from device

// accessible on host again after auto buffer destruction
for (int i = 0; i < N; i++) {
    std::cout << data[i] << std::endl;
}

// destroy data on USM
free(data);

After using USM, developer does not need to care the data flow:

queue q;

// allocating data on USM
auto data = malloc_shared<int>(N, q); // specify queue

// initialize on host
for (int i = 0; i < n; i++) {
    data[i] = 10;
}

// accessible on device
q.parallel_for(N, [=](auto i) {
    data[i] += 1;
}).wait; // Need to add an explicit wait for the data transfer

// accessible on host again after wait finished
for (int i = 0; i < N; i++) {
    std::cout << data[i] << std::endl;
}

// destroy data on USM
free(data, q); // specify queue

Ways

sycl::malloc_device: allocate memory on device, no accessible to host, need explicit data copy to access on host.
sycl::malloc_host: allocate memory on host, accessible to device via Direct Memory Access.
sycl::malloc_shared: allocate memory which can migrate between device and host by program automatically and implicitly. (Need wait to ensure the only access from host or device)

Example:

queue q;

int data[N];
for (int i = 0; i < N; i++) {
    data[i] == 10;
}

// allocate on device
int *data_device = malloc_device<int>(N, q);

// memory copy from host to device
q.memcpy(data_device, data, sizeof(int)*N).wait()

// access data on device
q.parallel_for(N, [=](auto i) {
    data_device[i] += 1;
}).wait();

// memory copy back from device to host
q.memcpy(data, data_device, sizeof(int)*N).wait();

for (int i = 0; i < N; i++) {
    std::cout << data[i] << std::endl;
}

data dependency

in_order feature will make all task serialized, avoiding the data race.

Example:

queue q{property::queue::in_order()};

int *data = malloc_shared<int>(N, q);
for (int i = 0; i < N; i++) {
    data[i] = 10;
}

q.parallel_for(N, [=](auto i) {
    data[i] += 2;
});
q.parallel_for(N, [=](auto i) {
    data[i] += 3;
});
q.parallel_for(N, [=](auto i) {
    data[i] += 4;
});

for (int i = 0; i < N; i++) {
    std::cout << data[i] << std::endl;
}
free(data, q);

depends_on can avoid making the situation the extreme by serializing all tasks.

Example:

queue q;

int *data1 = malloc_shared<int>(N, q);
int *data2 = malloc_shared<int>(N, q);
for (int i = 0; i < N; i++) {
    data1[i] = 10;
    data2[i] = 10;
}

auto e1 = q.parallel_for(N, [=](auto i) {
    data1[i] += 2;
});

auto e2 = q.parallel_for(N, [=](auto i) {
    data2[i] += 3;
});
q.submit([&](handler& h) {
    h.depends_on({e1, e2});
    h.parallel_for(N, [=](auto i) {
        data1[i] += data2[i];
    });
}).wait();

for (int i = 0; i < N; i++) {
    std::cout << data[i] << std::endl;
}
free(data1, q);
free(data2, q);

Sub Groups

See Reference. hierarchy

A sub-group is a collection of contiguous work-items in the global index space that execute in the same VE thread. When the device compiler compiles the kernel, multiple work-items are packed into a sub-group by vectorization so the generated SIMD instruction stream can perform tasks of multiple work-items simultaneously. Properly partitioning work-items into sub-groups can make a big performance difference.

Sub-group 是同时执行的或具有额外调度保证的工作项 work-item 的子集。
利用 Sub-group 有助于将执行映射到低级硬件，并可以帮助实现更高的性能。

工作组 work-group 中的所有工作会被安排到一个子切片 SubSlice 中，该子切片拥有自己的本地内存 SubSlice Local Memory，比访问 L3 Cache 快很多。
Sub-group 中的所有工作项都在单个 Execution Units 线程上执行。
Sub-group 中的每个工作项都映射到 SIMD 通道。

Why Sub-group?

work-item in Sub-group can communicate with each other using shuffle operations, avoiding accessing local or global memory, providing higher performance.
work-item in Sub-group can access sub-group group algorithm which implements frequently used parallel strategies.
- sycl::shift_group_left(sg, x, 1)
- sycl::shift_group_right(sg, x, 1)
- sycl::select_from_group(sg, x, id)
- sycl::permute_group_by_xor(sg, x, mask)
- sycl::group_broadcast(sg, x, id)
- sycl::reduce_over_group(sg, x, op)
- sycl::exclusive_can_over_group(sg, x, op)
- sycl::inclusive_can_over_group(sg, x, op)
- sycl::any_off_group(sg, x)
- sycl::all_off_group(sg, x)
- sycl::none_off_group(sg, x)

Example of sub-group group algorithm:

h.parallel_for(nd_range<1>(N, B), [=](nd_item<1> item) {
    auto sg = item.get_sub_group_by();
    auto i = item.get_global_id(0);

    /*Group Algorithm*/
    data[i] = reduce_over_group(sg, data[i], plus<>());
});

Some attribute function of work item:

get_local_id() returns the index(es) of the work-item within its work-group
get_local_range() returns the size(s) of work-group
get_global_id() returns the index(es) of the work-item within the global size
get_global_range() returns the global size
get_group_id() returns the index(es) of the work-group
get_group_range() returns the size(s) of work-group within the parent global size

These attributes return the id<n> data, thus needing the index to access(e.g. item.get_local_id()[0]), or passing the index to function as a parameter to fetch the one(e.g. item.get_local_id(0)). Example:

#include <CL/sycl.hpp>
using namespace sycl;

static constexpr size_t N = 8; // global size

int main() {
  queue q;
  std::cout << "Device : " << q.get_device().get_info<info::device::name>() << "\n";
  q.submit([&](handler &h) {
    // need to initialize a stream in device to print, std::cout is not applicable for device
    auto cout = stream(1024, 768, h);
    h.parallel_for(nd_range<3>({N, N, N}, {2, 2, 2}), [=](nd_item<3> item) {
        auto local_id = item.get_local_id();
        auto local_range = item.get_local_range();
        auto global_id = item.get_global_id();
        auto global_range = item.get_global_range();
        auto group = item.get_group();
        auto group_range = item.get_group_range();
        // Specify the ones to be printed, otherwise randomly choose some
        if (item.get_group_linear_id() == 0) {
            cout << "Local 3D(" << local_id[0] << "," << local_id[1] << "," << local_id[2] << ") ";
            cout << "Local R(" << local_range[0] << "," << local_range[1] << "," << local_range[2] << ") ";
            cout << "Global 3D(" << global_id[0] << "," << global_id[1] << "," << global_id[2] << ") ";
            cout << "Global R(" << global_range[0] << "," << global_range[1] << "," << global_range[2] << ") ";
            cout << "Group 3D(" << group[0] << "," << group[1] << "," << group[2] << ") ";
            cout << "Group R(" << group_range[0] << "," << group_range[1] << "," << group_range[2] << ") ";
            cout << "Global index:" << global_id[0]+N*(global_id[1]+N*global_id[2]) << " ";
            cout << "LLid:" << item.get_local_linear_id() << ", GLid:" << item.get_global_linear_id() << ", GrpLid:" << item.get_group_linear_id() << "\n";
        }
    });
  });
}

-------->

## u162308 is compiling DPCPP_Essentials Module4 -- DPCPP Sub Groups - 1 of 7 sub_group_info.cpp
Device : Intel(R) UHD Graphics P630 [0x3e96]
Local 3D(0,0,0) Local R(2,2,2) Global 3D(0,0,0) Global R(8,8,8) Group 3D(0,0,0) Group R(4,4,4) Global index:0 LLid:0, GLid:0, GrpLid:0
Local 3D(0,0,1) Local R(2,2,2) Global 3D(0,0,1) Global R(8,8,8) Group 3D(0,0,0) Group R(4,4,4) Global index:64 LLid:1, GLid:1, GrpLid:0
Local 3D(0,1,0) Local R(2,2,2) Global 3D(0,1,0) Global R(8,8,8) Group 3D(0,0,0) Group R(4,4,4) Global index:8 LLid:2, GLid:8, GrpLid:0
Local 3D(0,1,1) Local R(2,2,2) Global 3D(0,1,1) Global R(8,8,8) Group 3D(0,0,0) Group R(4,4,4) Global index:72 LLid:3, GLid:9, GrpLid:0
Local 3D(1,0,0) Local R(2,2,2) Global 3D(1,0,0) Global R(8,8,8) Group 3D(0,0,0) Group R(4,4,4) Global index:1 LLid:4, GLid:64, GrpLid:0
Local 3D(1,0,1) Local R(2,2,2) Global 3D(1,0,1) Global R(8,8,8) Group 3D(0,0,0) Group R(4,4,4) Global index:65 LLid:5, GLid:65, GrpLid:0
Local 3D(1,1,0) Local R(2,2,2) Global 3D(1,1,0) Global R(8,8,8) Group 3D(0,0,0) Group R(4,4,4) Global index:9 LLid:6, GLid:72, GrpLid:0

But it is not the same as in sub-group, the functions below all return type id<1>, needing to retrieve by indexing ret[0]:

get_local_id() returns the id of work item in its sub-group.
get_local_range() returns the range of items in its sub-group.
get_group_id() returns the id of its sub-group in the work-group.
get_group_range() returns the range of sub-groups in the work-group.

Sub-group size will be chosen by compiler implicitly, but also can be assigned with supported size(related to hardware). Query the supported sizes by get_info:

auto sizes = q.get_device().get_info<info::device::sub_group_sizes>();

User can define the size used for sub group by adding [[intel::reqd_sub_group_size(8)]]:

q.submit([&](handler &h) {
    //# setup sycl stream class to print standard output from device code
    auto out = stream(1024, 768, h);

    //# nd-range kernel with user specified sub_group size
    h.parallel_for(nd_range<1>(N, B), [=](nd_item<1> item)[[intel::reqd_sub_group_size(8)]] {
        //# get sub_group handle
        auto sg = item.get_sub_group();

        //# query sub_group and print sub_group info once per sub_group
        if (sg.get_local_id()[0] == 0) {
        out << "sub_group id: " << sg.get_group_id()[0] << " of "
            << sg.get_group_range()[0] << ", size=" << sg.get_local_range()[0]
            << "\n";
        }
    });
}).wait();

How to implement the get_sub_group_linear_id() function? sub_group_linear_id = sub_group_id in work_group + sub_group_range * work_group_id

q.parallel_for(nd_range<1>(N, B), [=](nd_item<1> item) {
    auto sg = item.get_sub_group();
    auto sg_linear_id = sg.get_group_id() + sg.get_group_range() * item.get_group()[0];
}).wait();

Reduction

Before introducing simplified reduction in SYCL 2020, a reduction in global size is implemented with 2 steps, reduction on work group and accumulation on work-group result:

q.parallel_for(nd_range<1>(N, B), [=](nd_item<1> item) {
    auto wg = item.get_group();
    auto i = item.get_global_id();

    int sum_wg = reduce_over_group(wg, data[i], plus<>());

    if (item.get_local_id(0) == 0) { // item local id in work group, stride = B
        data[i] = sum_wg;
    }
});

q.single_task([=]() {
    int sum = 0;
    for (int i = 0; i < N; i+=B) { // stride = B
        sum += data[i];
    }
    data[0] = sum;
});

After simplified reduction, we can use reduction object to make this thing:

queue q;
auto data = malloc_shared<int>(N, q);
for (int i = 0; i < N; i++) {
    data[i] = i;
}

auto sum = malloc_shared<int>(1, q);
sum[0] = 0;

q.parallel_for(nd_range<1>(N, B),
               reduction(sum, plus<>()),
               [=](nd_item<1> item, auto& sum) {
    int i = item.get_global_id(0);
    sum += data[i];
}).wait();

std::cout << "Sum = " << sum[0] << std::endl;

free(data, q);
free(sum, q);

Multiple Reductions in one kernel:

  {
    //# create buffers
    buffer buf_data(data);
    buffer buf_sum(&sum, range(1));
    buffer buf_min(&min, range(1));
    buffer buf_max(&max, range(1));

    q.submit([&](handler& h) {
      //# create accessors for data and results
      accessor acc_data(buf_data, h, read_only);
        
      //# define reduction objects for sum, min, max reduction
      auto reduction_sum = reduction(buf_sum, h, plus<>());
      auto reduction_min = reduction(buf_min, h, minimum<>());
      auto reduction_max = reduction(buf_max, h, maximum<>());
      
      //# parallel_for with multiple reduction objects
      h.parallel_for(nd_range<1>{N, B}, 1.2, reduction_sum, reduction_min, reduction_max, [=](nd_item<1> it, auto& temp_sum, auto& temp_min, auto& temp_max) {
        auto i = it.get_global_id();
        temp_sum.combine(acc_data[i]);
        temp_min.combine(acc_data[i]);
        temp_max.combine(acc_data[i]);
      });
    });
  }

L4 Profiler

VTune Profiler

Intel® VTune™ Profiler Performance Analysis Cookbook: Link

Profiling types

hotspots
memory-consumption
uarch-exploration
memory-access
threading
hpc-performance
system-overview
graphic-rendering
io
fpga-interaction
gpu-offload
gpu-hotspots
throttling
platform-profiler
cpugpu-concurrency
tsx-exploration
tsx-hotspots
sgx-hotspots

Example:

#!/bin/bash
source /opt/intel/inteloneapi/setvars.sh

type=gpu-hotspots
result_dir=vtune-result
program=a.out
args=xxx

# collect info
vtune -collect $type -result-dir $(result_dir) $(pwd)/$(program) $args

# report info
vtune -report summary -result-dir $result -format html -report-output $(pwd)/summary.html

# display report
from IPython.display import IFrame
IFrame(src='summary.html', width=960, height=600)

Advisor

Roofline

Command Lines to run roofline example:

#!/bin/bash

# To run the GPU Roofline analysis in the Intel® Advisor CLI:
  
# 1. Run the Survey analysis with the --enable-gpu-profiling option:   
advixe-cl –collect=survey --enable-gpu-profiling --project-dir=./adv -- ./matrix.dpcpp

# 2. Run the Trip Counts and FLOP analysis with --enable-gpu-profiling option:   
advixe-cl -–collect=tripcounts --stacks --flop --enable-gpu-profiling --project-dir=./adv -- ./matrix.dpcpp

# 3. Generate a GPU Roofline report:    
advixe-cl --report=roofline --gpu --project-dir=./adv

# 4. Open the generated roofline.html in a web browser to visualize GPU performance.

Vectorization

Offload Modeling

The easiest way to run Offload Advisor is to use the batch mode that consists in running 2 scripts available is the folder $APM ($APM is available when Advisor is sourced).

collect.py: Used to collect data such as timing, flops, tripcounts and many more
analyze.py: Creating the report

To be more specific, collect.py runs the following analyses:

survey: Timing your application functions and loops, reading compiler diagnostics
tripcount: With flops and cache simulation to count the number of iterations in the loops as well as the number of operations and memory transfers
dependency: Check if you have data dependency in your loops, preventing it to be good candidates for offloading or vectorization

Offload Advisor is currently run from the command-line as below. Once the run is complete you can view the generated report.html.

#!/bin/bash

advixe-python $APM/collect.py advisor_project --config gen9 -- ./matrix.dpcpp
advixe-python $APM/analyze.py advisor_project --config gen9 --out-dir ./analyze

L5 GPU Optimization Guide

Amdahl's Law: 并行加速比取决于可并行部分的比重。
- $acc = \frac{W_s+W_p}{W_s+\frac{W_p}{p}}$
Locality Matters: 加速卡的专用内存也有层级
- latency: register > cache > DRAM/HBM
- 三原则：
  - 尽可能在加速器上保留数据
  - 内核执行时访问连续的内存块
  - 重构代码以获得更高的数据重用

GPU Architecture

See reference.

Xe Tile
- 4HBM
- 64GB Memory
- Up to 4096 SMT threads
- 4 Xe Slice
  - 16 Xe Core (a.k.a Dual Sub Slice(DSS) or Sub Slice(SS))
    一次分配一个 work-group
    8 vector engine (a.k.a. Execution Unit(EU))
    执行 SIMD 指令，一个时钟周期执行 32 个数据的指令。
    一次分配一个 sub-group
    512 FP16 operations/cycle
    256 FP32 operations/cycle
    256 FP64 operations/cycle
    8 matrix engine
    8192 INT8 operations/cycle
    4096 BF16 operations/cycle
    2048 FP32 operations/cycle
    512K L1 cache/SLM
    512B/cycle load/store memory access bandwidth
本次 PAC 是内部版本（ATS-P），有两个 Tile，每个含 30 Xe Core，即 480个 EU。共 960 EU。
What is EU?
An Execution Unit (EU) is the smallest thread-level building block of the Intel® Iris® Xe-LP GPU architecture. Each EU is simultaneously multithreaded (SMT) with seven threads. The primary computation unit consists of a 8-wide Single Instruction Multiple Data (SIMD) Arithmetic Logic Units (ALU) supporting SIMD8 FP/INT operations and a 2-wide SIMD ALU supporting SIMD2 extended math operations. Each hardware thread has 128 general-purpose registers (GRF) of 32B wide.

Tile -> Slice -> Core -> EU

Thread mapping and GPU utilization

See Reference

Thread context: 线程上下文，内核应该有足够多的线程来占满 GPU context
- EU 上可放线程数量：Xe-LP 7 thread, Xe-HPC 8 thread
SIMD unit: 程序应该向量化利用 SIMD 指令。
- Xe-LP 执行 Int 是 SIMD32。
- 在 ATS-P 上，FP32 和 FP64 都是 SIMD16。
SYCL 中限制 workgroup size 是 512。

Calculation: 以 Xe-LP(Low Power低功耗家用版本) 为例，其上有 6 个 sub-slice，每个 sub-slice 上有 16 个 EU，每个 EU 有 7 个 thread。下面的例子中 sub-group 大小为 8。

auto command_group =
    [&](auto &cgh) {
      cgh.parallel_for(nd_range(sycl::range(64, 64, 128), // global range
                                sycl::range(1, R, 128)    // local range
                                ),
                       [=](sycl::nd_item<3> item) [[intel::reqd_sub_group_size(8)]] {
                         // (kernel code)
                         // Internal synchronization
                         item.barrier(access::fence_space::global_space);
                         // (kernel code)
                       })
    }

Work-items

Group Size

Threads

Xe-core Util

Xe-core Occu

64x64x128=524288

(R=1) 128

16/112=14%

100% with 7 work-groups

64x64x128=524288

(R=2) 128x2

2x16=32

32/112=28.6%

86% with 3 work-groups

64x64x128=524288

(R=3) 128x3

3x16=48

48/112=42.9%

86% with 2 work-groups

64x64x128=524288

(R=4) 128x4

4x16=64

64/112=57%

57% maximum

64x64x128=524288

(R=5) 640+

Fail to launch

Explanation: 当 group size = 128，每个 Sub-slice 每次分配到 128 个 item，因为 sub-group = 8，所以需要 128/8=16 个 EU 完成计算。每个 Sub-slice 上刚好有 16个 EU，因此每个 EU 只需要一个 Thread 即可完成，共用 16 Threads。所以 Thread 对应的 Xe-core Util 计算为: $\frac{16Thread}{7\frac{Threads}{EU}\times16EU}=\frac{1}{7}=14.2%$ 因为任务总量足够多(64x64x128=524288)，因此一次 Dispatch 可以放满所有的 Threads，Xe-core Occupation 为 100%。

当 group size = 512，每个 Sub-slice 每次分配到 512 个 item，因为 sub-group = 8，因此需要 512/8=64 个 EU 完成计算。每个 Sub-slice 上有 16 个 EU，因此每个 EU 需要压 64/16=4 个线程，共用 64 Threads。所以 Thread 对应的 Xe-core Util 计算为：$\frac{64Thread}{7\frac{Threads}{EU}\times16EU}=\frac{4}{7}=57%$ 即使任务总量足够多，但是每个 workgroup 都至少需要 EU 压上 4 个线程，对于只有 7 个 thread context 的 Xe-LP 而言，无法容纳第二个 workgroup，因此每次 Dispatch 只能分配 1个 workgroup，Xe-core Occupation 为 $\frac{4}{7}=57%$

Conclusion：因为 Dispatch 会有额外开销，所以尽量提高 Xe-Occupation 以实现最佳性能。

Exercise: ATS-P 有两个 Tile，每个 Tile 上 30 个 Xe-core，每个 Xe-core 上 16 个 EU，即 480 个 EU，共 960 个 EU。

在 ATS-P 上，假设总数据量是 96x512，数据类型是 FP64
- Q1: 当 workgroup size = 512 时，一个 workgroup 给 Xe-core 上分配几个线程？
- Q2: 一个 Xe-core 可以分配多少个 workgroup？
- Q3: 把全部数据分配到 ATS-P 上，需要几次 dispatch？
A1: 对于数据类型是 FP64 的可以使用 SIMD16，所以可以设置 sub-group size=16，每次可以执行 512/16=32 个线程。
A2: ATS-P 每个 Xe-core 上一共有 16*8=128 个线程，因此可以分配 128/32=4 个 work-group。
A3: 数据一共需要 96x512/512=96 个 work-group。每次每个 Xe-core 可以分配4个，一次可以分配 2*30*4=240 个 work-group。因此理论上一次可以分配完。

SLM (Shared Local Memory)

所有 work item 之间通信/共享：使用 global memory，带宽低延迟高。
一个 subgroup(EU) 里面的 work item 可以共享数据，但是范围太小。
一个 workgroup(Xe-core) 里可以共享数据，使用 SLM。 ATS-P 上有 64K 大小的 SLM。
SLM 是显式的调用，通过 sycl::malloc_shared，或者 accessor<..., target::local> 。
可以根据 work item 使用的空间大小确定一个 Xe-core 上可以放的 work group size。例如一个 work item 使用 512B 的数据的话，workgroup size最大为 65536/512=128。此时 thread context 无法使用满。
如何设置每个 work item 使用的数据大小来填满 thread context？
- Xe-core 有 16EU x 8 Threads = 128 Threads，每个 Thread 可以处理 SIMD16 的 FP32/FP64，因此设置 sub-group size 为 16，则可以处理 128x16=2048 个 work-item。 65536/2048=32B/work item。
SLM Bank Conflict: SLM 上的 Cache Line 也是 64 字节，被分为 16 个 Bank，每个 Bank 有 4 个字节。16 个 work item可以并行访问 16 个 Bank。但是访问同一 Bank 里面的不同地址会导致 Bank Conflict。

Multi-card usage

See Reference:Considerations for programming to multi-tile and multi-card under Level-Zero backend. Example of fetching all available devices in machine(here we only collect level-zero devices):

// Initialize the available queues
std::vector<queue> qs;
auto platforms = platform::get_platforms();
for (auto& p : platforms) {
        auto pp = p.get_info<info::platform::name>();
        scout << "Platform: " << pp << sendl;
        std::string level_zero = "Level-Zero";
        if (pp.substr(pp.size() - level_zero.size()).compare(level_zero) == 0) {
                auto devices = p.get_devices();
                for (auto& d : devices) {
                        scout << "Device: " << d.get_info<info::device::name>() << sendl;
                        if (d.is_gpu()) {
                                scout << "Device is GPU - adding to vector of queues" << sendl;
                                qs.push_back(queue(d));
                        }
                }
        }
}
assert(qs.size() == world_size);
queue q = qs[my_rank];

L6.1 Report

每一页写优化前后效果
使用的工具分析需要贴图
新设立 oneapi 奖项

L6.2 Optimization Examples

PAC2021 初赛优化样例

向量化 report

通过 -qopt-report=5 生成向量化报告，通过报告发现向量化不理想，于是对函数进行拆分，再继续通过 -qopt-zmm-usage=high 编译强制编译器使用高位宽指令集。

生成报告步骤：

icpx -fiopenmp -fopenmp-target=spir64 -qopt-report=5 foo.c 生成两个 yaml 文件，其中一个名为 foo.opt.yaml。
使用 oneapi 自带的 opt-viewer.py 生成 html 格式的文件：opt-viewer.py foo.opt.yaml 输出 foo.c.html
使用 web 浏览器可以打开查看报告。

MPI+OMP 混合并行

基于平台提供的众核的特点，使用混合并行，每个 MPI 程序每次处理一张图片，图片内部使用 OMP 进行并行，最后通过 Reduction 归约操作汇总到一个进程中写入结果文件。

GPU Offload 样例

对于需要在 GPU 上使用 openmp 的程序而言，在编译的时候需要加入 -fiopenmp -fopenmp-target=spir64 的选项。
对于 omp 难以调试的问题可以使用环境变量 LIBOMPTARGET_DEBUG=1/2 来开启 debug 级别 log。

OpenMP 制导语句中的 map 表示数据的传输方向（一维数组）：

to: copy to device
from: copy back to host
tofrom: copy to and back
alloc: allocate on device
release: release the data allocated on device

#pragma omp target map(to:x[0:n]) map(tofrom:y[0:n]) // 指定函数在 target device 上进行计算
#pragma omp parallel for
for (int i = 0; i <N; i++) {
    y[i] += a*x[i];
}

对于高维数组而言：

float *A[ny];
for (int j = 0; j < ny; j++) {
    A[j] = (float*)omp_target_alloc_device(nx*h*sizeof(float), deviceID);
    omp_target_memcpy(stack_recon[j], stack_recon_device[j], nx*h*sizeof(float), 0, 0, 0, deviceID);
}

FFT offloading

#pragma omp target data map(tofrom:x[0:N[0]*N[1]]) device(devNum)
{
    #pragma omp target variant dispatch use_device_ptr(x) device(devNum)
    {
        forward_plan = fftwf_plan_dft(2, N, x, x, FFTW_FORWARD, FFTW_ESTIMATE);
    }
    #pragma omp target variant dispatch device(devNum)
    {
        fftwf_execute(forward_plan);
    }
}

异步计算

通过 omp 的 nowait 可以实现派发任务到 target device 后，CPU 和 GPU 同时计算。通过 depend 实现依赖控制。

#pragma omp declare target // 指导以下代码都在 target 上使用 omp
#include <stdlib.h>
#include <omp.h>
extern void compute(float*, float*, int);
#pragma omp end declare target

void vec_mult_async(float* p, float* v1, float* v2, int N) {
    int i;
    
    #pragma omp target enter data map(alloc: v1[:N], v2[:N])
    
    #pragma omp target nowait depend(out: v1, v2)
    compute(v1, v2, N);

    #pragma omp task
    other_work_on_cpu(); // execute asyn on host device

    #pragma omp target map(from: p[0:N]) nowait depend(in: v1, v2)
    {
        #pragma omp distribute parallel for
        for (int i = 0; i < N; i++) {
            p[i] = v1[i] + v2[i];
        }
    }
    #pragma omp target exit data map(release: v1[:N], v2[:N])
}

Others

Using DPC++ with MPI

See reference.

export I_MPI_CXX=dpcpp
mpiicpc -fsycl -std=c++17 -lsycl -ltbb main.cpp -o dpc_reduce

The End

PreviousPAC 2022 NextOpenMP & MPI

Last updated 2 years ago

Was this helpful?