# Lectures

## L1 Intro

### 硬件平台

* 计算节点：
  * 计算节点：高性能刀片或者机架服务器
  * 胖服务器节点：SMP架构高性能服务器，具备多CPU和大内存容量
  * GPU计算节点：服务器中安装GPGPU卡，使用GPU进行运算加速
* 三平面组网：
  * 数据网络：用于存储或者数据传输
  * 计算网络：用于计算过程中的消息传递
  * 管理网络：用于集群系统管理

### HPC典型业务负载和应用

![](/files/0OkRH2Q2txQP9dgObdHp)

### Intel 特性

* AI量化
* Sapphire Rapids: 提供 AMX 矩阵运算指令集，提供超AVX512 16倍的性能
* GPU: Ponte Vecchio
  * 128X Cores
  * 64MB L1 cache + 408MB L2 Cache + 128GB HBM2e
* Optane 持久内存
* DAOS 文件存储格式
  * 极高IO带宽和IOPS
  * 克服给予块和POSIX接口的性能瓶颈，支持HPC/AI/BIG DATA多种业务接口
* oneAPI 统一的软件堆栈
  * Low-Level Libraries:
    * oneMKL: Math Kernel Library
    * oneDNN: Deep Neural Network
    * oneDTL: Data Transform Library
    * oneIPL: Image Process Library
    * oneVPL: Video Process Library
    * oneTBB: Threading Building Blocks
    * oneCCL: Collective Communication Library
    * oneDPL: DPC++ Library
    * etc
  * Languages:
    * DPC++
    * SYCL
    * etc
  * DPCT: Intel DPC++ Compatibility Tool
    * 协助开发者完成一次性代码迁移，从CUDA到DPC++
  * CFlags:
    * Profile guided optimization(multi-step build):
      * `-prof-gen`: generation
      * `-prof-use`: usage
    * `-fast`: `-ipo -O3 -no-prec-div -static -fp-mode fast`
    * `-fiopenmp -fopenmp-targets=spir64`: offloading openmp on gpu device
    * `-O2` vs `-O3`: in some scenarios, `O2` performs better than `O3`
    * `-mprefer-vector-width=512`: LLVM cflag, indicating the preference to AVX512
    * `-ipo`: multi-file inter-procedural optimization, aiming at the potential optimization in caller/callee chain
    * `-pad`: padding the array to rearrange the memory distribution to make full use of cache
  * Compilers:

| Driver |    Target    | OpenMP support | OpenMP offload Support |
| :----: | :----------: | :------------: | :--------------------: |
|   icc  |      CPU     |        Y       |            N           |
|  dpcpp | CPU,GPU,FPGA |        N       |            N           |
|   icx  |    CPU,GPU   |        Y       |            Y           |
|  ifort |      CPU     |        Y       |            N           |
|   ifx  |    CPU,GPU   |        Y       |            Y           |

## L2.1 DPC++

### Intel Advisor

* Offload Advisor: 预估卸载至加速器的性能
* Roofline Analysis: 优化用于内存和计算的 CPU/GPU 代码
* Vectorization Advisor: 添加和优化矢量化
* Threading Advisor: 为非线程化应用添加有效的线程化功能
* 流图分析器：高校创建和分析流图

### DPC++

```cpp
#include <CL/sycl.cpp>
constexpr int N = 16;
using namespace sycl;

int main() {
    // On host machine
    queue q;
    int *data = malloc_shared<int>(N, q); // Allocating on USM(unified shared memory)
    
    // On accelerator
    q.parallel_for(N, [=](auto i) {
        data[i] = i;
    }).wait();

    // On host machine
    for (int i = 0; i < N; i++) {
        std::cout << data[i] << std::endl;
    }
    free(data, q);
}
```

#### Device Selector

`device_selector` supports selecting specific devices in runtime, executing the kernel with information provided by user heuristically.\
Example:

```cpp
default_selector d_selector;
host_selector h_selector;
cpu_selector c_selector;
gpu_selector g_selector;
queue q(d_selector);
std::cout << "Device: " << q.get_device().get_info<info::device::name>() << std::endl;
```

#### queue

* Queue is the execution of submitting commands to SYCL runtime.
* Queue is the mechanism of submitting tasks to devices.
* Queue:Device is M:1 relationship.

```cpp
queue q;
q.submit([&](handler& h) {
    // COMMAND GROUP CODE
});
```

#### kernel

* Kernel encapsulates the data and function when executing in device.
* Kernel is not constructed explicitly by user.
* Kernel is constructed when calling kernel scheduling functions like `parallel_for`.

The lambda function submitted in queue is called **command group**.

```cpp
q.submit([&](handler& h) {
    h.parallel_for(range<1>(N), [=](id<1> i) {
        A[i] = B[i] + C[i];
    });
});
```

#### Basic Parallel Kernel

* `range`: describe the parallel execution iteration space
* `id`: index a single instance in parallel execution
* `item`: a single instance, can be used to query the range and other attributes

```cpp
// CPU Code
for (int i = 0; i < 1024; i++) {
    // CODE
}

// Parallel Code
h.parallel_for(range<1>(1024), [=](id<1> idx) {
    // CODE
});

h.parallel_for(range<1>(1024), [=](item<1> item) {
    auto idx = item.get_id();
    auto R = item.get_range();
    // CODE
});
```

#### ND-Range Kernel

* `ND-Range` 内核是另一种表示并行性的方法，通过提供对本地内存的访问以及将执行映射到硬件上的计算单元实现底层性能调整。
* `nd_range` 将整个迭代空间划分为多个较小的**工作组**，工作组中的工作被安排在硬件的单个计算单元上。
* 讲内核执行分组到工作组中，可以控制工作分配中的资源使用和负载均衡。
* `nd_range` 的第一个参数表示整个迭代空间的大小(`global_size`)，第二个参数表示工作组的大小(`work-group size`)。

```cpp
h.parallel_for(nd_range<1>(range<1>(1024) /*global size*/,
                           range<1>(64))  /*work-group size*/,
                           [=](nd_item<1> item) { /*actual work task*/
    auto idx = item.get_global_id;
    auto local_id = item.get_local_id();
    // CODE
});
```

#### 缓冲区内存模式

* 缓冲区表示数据的**所有权**，不表示数据具体的**存储位置**
* 缓冲区：将数据封装在 SYCL 应用中
* 访问器：用于实际访问缓冲区中的数据
  * 在 SYCL 图中创建数据依赖关系，以在内核执行中进行排序

对于下面的例子而言，`accessor` 与 `handler` 做绑定，`handler` 与 `queue` 绑定，决定所在的设备，于是 `accessor` 提供到对应设备的访问能力。

```cpp
queue q;
std::vector<int> v(N, 10);
{
    buffer buf(v);
    q.submit([&](handler& h) {
        accessor a(buf, h, write_only);
        h.parallel_for(N, [=](auto i) { a[i] = i; });
    });
}
for (int i = 0; i < N; i++) {
    std::cout << v[i] << " ";
}
```

### 异步执行

SYCL 应用分为两个部分：

* 主机代码
* 内核执行图

除了同步操作，这些执行都是互相独立的

* 主机代码提交工作(`q.submit`)以构建`graph`图（同时自身可以执行计算操作）
* 内核执行和数据移动的`graph`图由SYCL运行时管理与主机代码异步zhixing

若存在多个内核，SYCL会根据`accessor`自动解决数据依赖：

```cpp
int main() {
    auto R = range<1>(num);
    buffer<int> A{R}, B{R};
    queue q;

    // Kernel 1
    q.submit([&](auto& h) {
        accessor out(A, h, write_only);
        h.parallel_for(R, [=](id<1> i) {
            out[i] = i;
        });
    });
    // Kernel 2
    q.submit([&](auto& h) {
        accessor out(A, h, write_only);
        h.parallel_for(R, [=](id<1> i) {
            out[i] = i;
        });
    });
    // Kernel 3
    q.submit([&](auto& h) {
        accessor out(B, h, write_only);
        h.parallel_for(R, [=](id<1> i) {
            out[i] = i;
        });
    });
    // Kernel 4
    q.submit([&](auto& h) {
        accessor in(A, h, read_only);
        accessor inout(B, h);
        h.parallel_for(R, [=](id<1> i) {
            inout[i] *= in[i];
        });
    });

}
```

数据依赖图可以分析得到：

```mermaid
graph TD
Start -->|A| 1 -->|A| 2 -->|A| 4 --> End
Start -->|B| 3 -->|B| 4 
```

### 同步操作

#### 主机访存器

主机访问设备上的数据通过创建 `host_accessor` 实现。\
因为数据可以还在计算过程中，所以主机访存器的创建是一个阻塞的同步操作，只有等到内核均完成执行并且数据可以通过主机访存器提供给主机后才返回。

```cpp
#include <CL/sycl.hpp>
using namespace sycl;
constexpr int N = 16;

int main() {
    std::vector<double> v(N, 100);
    queue q;

    buffer buf(v); // 缓冲区掌握着存储在矢量中的数据
    q.submit([&](auto& h) {
        accessor a(buf, h);
        h.parallel_for(N, [=](auto i) {
            a[i] -= 2;
        });
    });

    host_accessor b(buf, read_only); // 阻塞请求
    for (int i = 0; i < N; i++) {
        std::cout << b[i] << std::endl;
    }
    return 0;
}
```

#### 缓冲区销毁

```cpp
#include <CL/sycl.hpp>
using namespace sycl;
constexpr int N = 16;

void dpcpp_code(std::vector<double>& v, queue& q) {
    buffer buf(v); // 当缓冲区的创建在单独的函数范围内进行
    q.submit([&](auto& h) {
        accessor a(buf, h);
        h.parallel_for(N, [=](auto i) {
            a[i] -= 2;
        });
    });
}

int main() {
    std::vector<double> v(N, 10);
    queue q;
    dpcpp_code(v, q); 
    // 完成该函数的调用后会调用缓冲区的销毁函数(析构函数)
    // 缓冲区将放弃数据的所有权，将数据拷贝回主机内存
    // 因此主机可以直接访问 v 上计算完成的数据
    for (int i = 0; i < N; i++) {
        std::cout << v[i] << std::endl;
    }
    return 0;
}
```

### Example

```cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <CL/sycl.hpp>
#include <iomanip>
#include <vector>
// dpc_common.hpp can be found in the dev-utilities include folder.
// e.g., $ONEAPI_ROOT/dev-utilities/<version>/include/dpc_common.hpp
#include "dpc_common.hpp"
#include "Complex.hpp"

using namespace sycl;
using namespace std;

// Number of complex numbers passing to the SYCL code
static const int num_elements = 10000;

class CustomDeviceSelector : public device_selector {
 public:
  CustomDeviceSelector(std::string vendorName) : vendorName_(vendorName){};
  int operator()(const device &dev) const override {
    int device_rating = 0;
    //We are querying for the custom device specific to a Vendor and if it is a GPU device we
    //are giving the highest rating as 3 . The second preference is given to any GPU device and the third preference is given to
    //CPU device. 
    if (dev.is_gpu() & (dev.get_info<info::device::name>().find(vendorName_) !=
                        std::string::npos))
      device_rating = 3;
    else if (dev.is_gpu())
      device_rating = 2;
    else if (dev.is_cpu())
      device_rating = 1;
    return device_rating;
  };

 private:
  std::string vendorName_;
};

// in_vect1 and in_vect2 are the vectors with num_elements complex nubers and
// are inputs to the parallel function
void DpcppParallel(queue &q, std::vector<Complex2> &in_vect1,
                   std::vector<Complex2> &in_vect2,
                   std::vector<Complex2> &out_vect) {
  auto R = range(in_vect1.size());
  if (in_vect2.size() != in_vect1.size() || out_vect.size() != in_vect1.size()){ 
    std::cout << "ERROR: Vector sizes do not  match"<< "\n";
    return;
  }
  // Setup input buffers
  buffer bufin_vect1(in_vect1);
  buffer bufin_vect2(in_vect2);

  // Setup Output buffers 
  buffer bufout_vect(out_vect);

  std::cout << "Target Device: "
            << q.get_device().get_info<info::device::name>() << "\n";
  // Submit Command group function object to the queue
  q.submit([&](auto &h) {
    // Accessors set as read mode
    accessor V1(bufin_vect1,h,read_only);
    accessor V2(bufin_vect2,h,read_only);
    // Accessor set to Write mode
    accessor V3 (bufout_vect,h,write_only);
    h.parallel_for(R, [=](auto i) {
      V3[i] = V1[i].complex_mul(V2[i]);
    });
  });
  q.wait_and_throw();
}
void DpcppScalar(std::vector<Complex2> &in_vect1,
                 std::vector<Complex2> &in_vect2,
                 std::vector<Complex2> &out_vect) {
  if ((in_vect2.size() != in_vect1.size()) || (out_vect.size() != in_vect1.size())){
    std::cout<<"ERROR: Vector sizes do not match"<<"\n";
    return;
    }
  for (int i = 0; i < in_vect1.size(); i++) {
    out_vect[i] = in_vect1[i].complex_mul(in_vect2[i]);
  }
}
// Compare the results of the two output vectors from parallel and scalar. They
// should be equal
int Compare(std::vector<Complex2> &v1, std::vector<Complex2> &v2) {
  int ret_code = 1;
  if(v1.size() != v2.size()){
    ret_code = -1;
  }
  for (int i = 0; i < v1.size(); i++) {
    if (v1[i] != v2[i]) {
      ret_code = -1;
      break;
    }
  }
  return ret_code;
}
int main() {
  // Declare your Input and Output vectors of the Complex2 class
  vector<Complex2> input_vect1;
  vector<Complex2> input_vect2;
  vector<Complex2> out_vect_parallel;
  vector<Complex2> out_vect_scalar;

  for (int i = 0; i < num_elements; i++) {
    input_vect1.push_back(Complex2(i + 2, i + 4));
    input_vect2.push_back(Complex2(i + 4, i + 6));
    out_vect_parallel.push_back(Complex2(0, 0));
    out_vect_scalar.push_back(Complex2(0, 0));
  }

  // Initialize your Input and Output Vectors. Inputs are initialized as below.
  // Outputs are initialized with 0
  try {
    // Pass in the name of the vendor for which the device you want to query
    std::string vendor_name = "Intel";
    // std::string vendor_name = "AMD";
    // std::string vendor_name = "Nvidia";
    // queue constructor passed exception handler
    CustomDeviceSelector selector(vendor_name);
    queue q(selector, dpc_common::exception_handler);
    // Call the DpcppParallel with the required inputs and outputs
    DpcppParallel(q, input_vect1, input_vect2, out_vect_parallel);
  } catch (...) {
    // some other exception detected
    std::cout << "Failure" << "\n";
    std::terminate();
  }

  std::cout
      << "****************************************Multiplying Complex numbers "
         "in Parallel********************************************************"
      << "\n";
  // Print the outputs of the Parallel function
  int indices[]{0, 1, 2, 3, 4, (num_elements - 1)};
  constexpr size_t indices_size = sizeof(indices) / sizeof(int);

  for (int i = 0; i < indices_size; i++) {
    int j = indices[i];
    if (i == indices_size - 1) std::cout << "...\n";
    std::cout << "[" << j << "] " << input_vect1[j] << " * " << input_vect2[j]
              << " = " << out_vect_parallel[j] << "\n";
  }
  // Call the DpcppScalar function with the required input and outputs
  DpcppScalar(input_vect1, input_vect2, out_vect_scalar);

  // Compare the outputs from the parallel and the scalar functions. They should
  // be equal

  int ret_code = Compare(out_vect_parallel, out_vect_scalar);
  if (ret_code == 1) {
    std::cout << "Complex multiplication successfully run on the device"
              << "\n";
  } else
    std::cout
        << "*********************************************Verification Failed. Results are "
           "not matched**************************"
        << "\n";

  return 0;
}
```

## L2.2 oneMKL

运行库链接方式：查询[onemkl link line advisor](https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-link-line-advisor.html)

Contacts: <mkl.tces@intel.com>

### USM (Unified Shared Memory)

> 类似于 CUDA 中的 Unified Memory。

* 使用 Buffer API：buf 规定了数据的所有权，用户无需关心函数执行结果

```cpp
using mkl::blas::gemm;
int64_t n = 32;

device dev({host, cpu, gpu}_selector());
queue Q(dev);

size_t bytes = n*n*sizeof(double);
double* A = ...;
double* B = ...;
double* C = ...;

buffer<double, 1> A_buf{A, range<1>(n*n)};
buffer<double, 1> B_buf{B, range<1>(n*n)};
buffer<double, 1> C_buf{C, range<1>(n*n)};

gemm(Q, mkl::transpose::N, mkl::transpose::N,
     n, n, n, 1.0, A_buf, n, B_buf, n, 0.0, C_buf, n);
```

* 使用 USM API：用户定义了数据的共享存在方式，需要对任务完成与否负责

```cpp
using mkl::blas::gemm;
int64_t n = 32;

device dev({host, cpu, gpu}_selector());
queue Q(dev);

size_t bytes = n*n*sizeof(double);
double* A = malloc_shared(bytes, dev, Q.get_context());
double* B = malloc_shared(bytes, dev, Q.get_context());
double* C = malloc_shared(bytes, dev, Q.get_context());

gemm(Q, mkl::transpose::N, mkl::transpose::N,
     n, n, n, 1.0, A, n, B, n, 0.0, C, n);

Q.wait_and_throw(); // 等待任务对数据的写入结束
```

## L3.1 Compiler Guidance

* Frequently used commands:
  * `sycl-ls`: listing available devices
  * `export SYCL_DEVICE_FILTER={CPU|GPU|FPGA}`
* Helpful links
  * [Get Started with the Intel® oneAPI DPC++/C++ Compiler](https://www.intel.com/content/www/us/en/develop/documentation/get-started-with-dpcpp-compiler/top.html)
  * [Get Started with OpenMP\* Offload to GPU for the Intel® oneAPI DPC/C++ Compiler and Intel® Fortran Compiler](https://www.intel.com/content/www/us/en/develop/documentation/get-started-with-cpp-fortran-compiler-openmp/top.html)
  * [Intel® oneAPI DPC++/C++ Compiler Developer Guide and Reference](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-dpcpp-cpp-compiler-dev-guide-and-reference/top.html)

## L3.2 SYCL Features

### USM

#### Setup

USM is bind to specific queue, not accessible to other queue.

```cpp
int *data = malloc_shared<int>(N, q);
// or
int *data = static_cast<int*>(malloc_shared(N*sizeof(int), q));
```

#### Usage

Before using USM, the SYCL provides the buffer method:

```cpp
queue q;

// allocating data on host
int *data = static_cast<int*>(malloc(N*sizeof(int), q));

// initialize on host
for (int i = 0; i < n; i++) {
    data[i] = 10;
}

// create a scope for auto destroy
{
    buffer<int, 1> buf(data, range<1>(N));
    q.submit([&](handler& h) {
        // need the help of accessor to access data on device
        auto A = buf.get_access<access::mode::read_write>(h);
        h.parallel_for(range<1>(N), [=](id<1> i) {
            A[i] += 1;
        });
    });
} // auto buffer destruction, updating data on host from device

// accessible on host again after auto buffer destruction
for (int i = 0; i < N; i++) {
    std::cout << data[i] << std::endl;
}

// destroy data on USM
free(data);
```

After using USM, developer does not need to care the data flow:

```cpp
queue q;

// allocating data on USM
auto data = malloc_shared<int>(N, q); // specify queue

// initialize on host
for (int i = 0; i < n; i++) {
    data[i] = 10;
}

// accessible on device
q.parallel_for(N, [=](auto i) {
    data[i] += 1;
}).wait; // Need to add an explicit wait for the data transfer

// accessible on host again after wait finished
for (int i = 0; i < N; i++) {
    std::cout << data[i] << std::endl;
}

// destroy data on USM
free(data, q); // specify queue
```

#### Ways

* `sycl::malloc_device`: allocate memory on device, no accessible to host, need explicit data copy to access on host.
* `sycl::malloc_host`: allocate memory on host, accessible to device via `Direct Memory Access`.
* `sycl::malloc_shared`: allocate memory which can migrate between device and host by program automatically and implicitly. (Need `wait` to ensure the only access from host or device)

Example:

```cpp
queue q;

int data[N];
for (int i = 0; i < N; i++) {
    data[i] == 10;
}

// allocate on device
int *data_device = malloc_device<int>(N, q);

// memory copy from host to device
q.memcpy(data_device, data, sizeof(int)*N).wait()

// access data on device
q.parallel_for(N, [=](auto i) {
    data_device[i] += 1;
}).wait();

// memory copy back from device to host
q.memcpy(data, data_device, sizeof(int)*N).wait();

for (int i = 0; i < N; i++) {
    std::cout << data[i] << std::endl;
}
```

#### data dependency

* `in_order` feature will make all task serialized, avoiding the data race.

Example:

```cpp
queue q{property::queue::in_order()};

int *data = malloc_shared<int>(N, q);
for (int i = 0; i < N; i++) {
    data[i] = 10;
}

q.parallel_for(N, [=](auto i) {
    data[i] += 2;
});
q.parallel_for(N, [=](auto i) {
    data[i] += 3;
});
q.parallel_for(N, [=](auto i) {
    data[i] += 4;
});

for (int i = 0; i < N; i++) {
    std::cout << data[i] << std::endl;
}
free(data, q);
```

* `depends_on` can avoid making the situation the extreme by serializing all tasks.

Example:

```cpp
queue q;

int *data1 = malloc_shared<int>(N, q);
int *data2 = malloc_shared<int>(N, q);
for (int i = 0; i < N; i++) {
    data1[i] = 10;
    data2[i] = 10;
}

auto e1 = q.parallel_for(N, [=](auto i) {
    data1[i] += 2;
});

auto e2 = q.parallel_for(N, [=](auto i) {
    data2[i] += 3;
});
q.submit([&](handler& h) {
    h.depends_on({e1, e2});
    h.parallel_for(N, [=](auto i) {
        data1[i] += data2[i];
    });
}).wait();

for (int i = 0; i < N; i++) {
    std::cout << data[i] << std::endl;
}
free(data1, q);
free(data2, q);
```

### Sub Groups

See [Reference](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-gpu-optimization-guide/top/kernels/sub-group.html).\
![hierarchy](/files/L2TeYnEjZiPjsMtolDnt)

> A sub-group is a collection of contiguous work-items in the global index space that **execute in the same VE thread**. When the device compiler compiles the kernel, multiple work-items are packed into a sub-group by vectorization so the generated SIMD instruction stream can perform tasks of multiple work-items simultaneously. Properly partitioning work-items into sub-groups can make a big performance difference.

* `Sub-group` 是**同时执行**的或**具有额外调度保证**的工作项 `work-item` 的子集。
* 利用 `Sub-group` 有助于将执行映射到低级硬件，并可以帮助实现更高的性能。

![Intel Graphic Gen11 Architecture](/files/iRfYZgKq9NxUwB2gFegG)

* 工作组 `work-group` 中的所有工作会被安排到一个子切片 `SubSlice` 中，该子切片拥有自己的本地内存 `SubSlice Local Memory`，比访问 L3 Cache 快很多。
* `Sub-group` 中的所有工作项都在单个 `Execution Units` 线程上执行。
* `Sub-group` 中的每个工作项都映射到 `SIMD` 通道。

Why Sub-group?

* `work-item` in `Sub-group` can communicate with each other using `shuffle operations`, avoiding accessing local or global memory, providing higher performance.
* `work-item` in `Sub-group` can access `sub-group group algorithm` which implements frequently used parallel strategies.
  * `sycl::shift_group_left(sg, x, 1)`
  * `sycl::shift_group_right(sg, x, 1)`
  * `sycl::select_from_group(sg, x, id)`
  * `sycl::permute_group_by_xor(sg, x, mask)`
  * `sycl::group_broadcast(sg, x, id)`
  * `sycl::reduce_over_group(sg, x, op)`
  * `sycl::exclusive_can_over_group(sg, x, op)`
  * `sycl::inclusive_can_over_group(sg, x, op)`
  * `sycl::any_off_group(sg, x)`
  * `sycl::all_off_group(sg, x)`
  * `sycl::none_off_group(sg, x)`

Example of `sub-group group algorithm`:

```cpp
h.parallel_for(nd_range<1>(N, B), [=](nd_item<1> item) {
    auto sg = item.get_sub_group_by();
    auto i = item.get_global_id(0);

    /*Group Algorithm*/
    data[i] = reduce_over_group(sg, data[i], plus<>());
});
```

Some attribute function of work item:

* `get_local_id()` returns the index(es) of the work-item within its work-group
* `get_local_range()` returns the size(s) of work-group
* `get_global_id()` returns the index(es) of the work-item within the global size
* `get_global_range()` returns the global size
* `get_group_id()` returns the index(es) of the work-group
* `get_group_range()` returns the size(s) of work-group within the parent global size

These attributes return the `id<n>` data, thus needing the index to access(e.g. `item.get_local_id()[0]`), or passing the index to function as a parameter to fetch the one(e.g. `item.get_local_id(0)`).\
Example:

```cpp
#include <CL/sycl.hpp>
using namespace sycl;

static constexpr size_t N = 8; // global size

int main() {
  queue q;
  std::cout << "Device : " << q.get_device().get_info<info::device::name>() << "\n";
  q.submit([&](handler &h) {
    // need to initialize a stream in device to print, std::cout is not applicable for device
    auto cout = stream(1024, 768, h);
    h.parallel_for(nd_range<3>({N, N, N}, {2, 2, 2}), [=](nd_item<3> item) {
        auto local_id = item.get_local_id();
        auto local_range = item.get_local_range();
        auto global_id = item.get_global_id();
        auto global_range = item.get_global_range();
        auto group = item.get_group();
        auto group_range = item.get_group_range();
        // Specify the ones to be printed, otherwise randomly choose some
        if (item.get_group_linear_id() == 0) {
            cout << "Local 3D(" << local_id[0] << "," << local_id[1] << "," << local_id[2] << ") ";
            cout << "Local R(" << local_range[0] << "," << local_range[1] << "," << local_range[2] << ") ";
            cout << "Global 3D(" << global_id[0] << "," << global_id[1] << "," << global_id[2] << ") ";
            cout << "Global R(" << global_range[0] << "," << global_range[1] << "," << global_range[2] << ") ";
            cout << "Group 3D(" << group[0] << "," << group[1] << "," << group[2] << ") ";
            cout << "Group R(" << group_range[0] << "," << group_range[1] << "," << group_range[2] << ") ";
            cout << "Global index:" << global_id[0]+N*(global_id[1]+N*global_id[2]) << " ";
            cout << "LLid:" << item.get_local_linear_id() << ", GLid:" << item.get_global_linear_id() << ", GrpLid:" << item.get_group_linear_id() << "\n";
        }
    });
  });
}

-------->

## u162308 is compiling DPCPP_Essentials Module4 -- DPCPP Sub Groups - 1 of 7 sub_group_info.cpp
Device : Intel(R) UHD Graphics P630 [0x3e96]
Local 3D(0,0,0) Local R(2,2,2) Global 3D(0,0,0) Global R(8,8,8) Group 3D(0,0,0) Group R(4,4,4) Global index:0 LLid:0, GLid:0, GrpLid:0
Local 3D(0,0,1) Local R(2,2,2) Global 3D(0,0,1) Global R(8,8,8) Group 3D(0,0,0) Group R(4,4,4) Global index:64 LLid:1, GLid:1, GrpLid:0
Local 3D(0,1,0) Local R(2,2,2) Global 3D(0,1,0) Global R(8,8,8) Group 3D(0,0,0) Group R(4,4,4) Global index:8 LLid:2, GLid:8, GrpLid:0
Local 3D(0,1,1) Local R(2,2,2) Global 3D(0,1,1) Global R(8,8,8) Group 3D(0,0,0) Group R(4,4,4) Global index:72 LLid:3, GLid:9, GrpLid:0
Local 3D(1,0,0) Local R(2,2,2) Global 3D(1,0,0) Global R(8,8,8) Group 3D(0,0,0) Group R(4,4,4) Global index:1 LLid:4, GLid:64, GrpLid:0
Local 3D(1,0,1) Local R(2,2,2) Global 3D(1,0,1) Global R(8,8,8) Group 3D(0,0,0) Group R(4,4,4) Global index:65 LLid:5, GLid:65, GrpLid:0
Local 3D(1,1,0) Local R(2,2,2) Global 3D(1,1,0) Global R(8,8,8) Group 3D(0,0,0) Group R(4,4,4) Global index:9 LLid:6, GLid:72, GrpLid:0
```

But it is not the same as in sub-group, the functions below all return type `id<1>`, needing to retrieve by indexing `ret[0]`:

* `get_local_id()` returns the id of work item in its sub-group.
* `get_local_range()` returns the range of items in its sub-group.
* `get_group_id()` returns the id of its sub-group in the work-group.
* `get_group_range()` returns the range of sub-groups in the work-group.

`Sub-group` size will be chosen by compiler implicitly, but also can be assigned with supported size(related to hardware). Query the supported sizes by `get_info`:

```cpp
auto sizes = q.get_device().get_info<info::device::sub_group_sizes>();
```

User can define the size used for sub group by adding `[[intel::reqd_sub_group_size(8)]]`:

```cpp
q.submit([&](handler &h) {
    //# setup sycl stream class to print standard output from device code
    auto out = stream(1024, 768, h);

    //# nd-range kernel with user specified sub_group size
    h.parallel_for(nd_range<1>(N, B), [=](nd_item<1> item)[[intel::reqd_sub_group_size(8)]] {
        //# get sub_group handle
        auto sg = item.get_sub_group();

        //# query sub_group and print sub_group info once per sub_group
        if (sg.get_local_id()[0] == 0) {
        out << "sub_group id: " << sg.get_group_id()[0] << " of "
            << sg.get_group_range()[0] << ", size=" << sg.get_local_range()[0]
            << "\n";
        }
    });
}).wait();
```

How to implement the `get_sub_group_linear_id()` function?\
`sub_group_linear_id = sub_group_id in work_group + sub_group_range * work_group_id`

```cpp
q.parallel_for(nd_range<1>(N, B), [=](nd_item<1> item) {
    auto sg = item.get_sub_group();
    auto sg_linear_id = sg.get_group_id() + sg.get_group_range() * item.get_group()[0];
}).wait();
```

#### Reduction

Before introducing simplified reduction in SYCL 2020, a reduction in global size is implemented with 2 steps, reduction on work group and accumulation on work-group result:

```cpp
q.parallel_for(nd_range<1>(N, B), [=](nd_item<1> item) {
    auto wg = item.get_group();
    auto i = item.get_global_id();

    int sum_wg = reduce_over_group(wg, data[i], plus<>());

    if (item.get_local_id(0) == 0) { // item local id in work group, stride = B
        data[i] = sum_wg;
    }
});

q.single_task([=]() {
    int sum = 0;
    for (int i = 0; i < N; i+=B) { // stride = B
        sum += data[i];
    }
    data[0] = sum;
});
```

After simplified reduction, we can use **`reduction object`** to make this thing:

```cpp
queue q;
auto data = malloc_shared<int>(N, q);
for (int i = 0; i < N; i++) {
    data[i] = i;
}

auto sum = malloc_shared<int>(1, q);
sum[0] = 0;

q.parallel_for(nd_range<1>(N, B),
               reduction(sum, plus<>()),
               [=](nd_item<1> item, auto& sum) {
    int i = item.get_global_id(0);
    sum += data[i];
}).wait();

std::cout << "Sum = " << sum[0] << std::endl;

free(data, q);
free(sum, q);
```

Multiple Reductions in one kernel:

```cpp
  {
    //# create buffers
    buffer buf_data(data);
    buffer buf_sum(&sum, range(1));
    buffer buf_min(&min, range(1));
    buffer buf_max(&max, range(1));

    q.submit([&](handler& h) {
      //# create accessors for data and results
      accessor acc_data(buf_data, h, read_only);
        
      //# define reduction objects for sum, min, max reduction
      auto reduction_sum = reduction(buf_sum, h, plus<>());
      auto reduction_min = reduction(buf_min, h, minimum<>());
      auto reduction_max = reduction(buf_max, h, maximum<>());
      
      //# parallel_for with multiple reduction objects
      h.parallel_for(nd_range<1>{N, B}, 1.2, reduction_sum, reduction_min, reduction_max, [=](nd_item<1> it, auto& temp_sum, auto& temp_min, auto& temp_max) {
        auto i = it.get_global_id();
        temp_sum.combine(acc_data[i]);
        temp_min.combine(acc_data[i]);
        temp_max.combine(acc_data[i]);
      });
    });
  }
```

## L4 Profiler

### VTune Profiler

Intel® VTune™ Profiler Performance Analysis Cookbook: [Link](https://www.intel.com/content/www/us/en/develop/documentation/vtune-cookbook/top.html)

![](/files/kP11fka5LjTUok26buY1)

#### Profiling types

* hotspots
* memory-consumption
* uarch-exploration
* memory-access
* threading
* hpc-performance
* system-overview
* graphic-rendering
* io
* fpga-interaction
* gpu-offload
* gpu-hotspots
* throttling
* platform-profiler
* cpugpu-concurrency
* tsx-exploration
* tsx-hotspots
* sgx-hotspots

Example:

```bash
#!/bin/bash
source /opt/intel/inteloneapi/setvars.sh

type=gpu-hotspots
result_dir=vtune-result
program=a.out
args=xxx

# collect info
vtune -collect $type -result-dir $(result_dir) $(pwd)/$(program) $args

# report info
vtune -report summary -result-dir $result -format html -report-output $(pwd)/summary.html
```

```py
# display report
from IPython.display import IFrame
IFrame(src='summary.html', width=960, height=600)
```

### Advisor

#### Roofline

Command Lines to run roofline example:

```bash
#!/bin/bash

# To run the GPU Roofline analysis in the Intel® Advisor CLI:
  
# 1. Run the Survey analysis with the --enable-gpu-profiling option:   
advixe-cl –collect=survey --enable-gpu-profiling --project-dir=./adv -- ./matrix.dpcpp

# 2. Run the Trip Counts and FLOP analysis with --enable-gpu-profiling option:   
advixe-cl -–collect=tripcounts --stacks --flop --enable-gpu-profiling --project-dir=./adv -- ./matrix.dpcpp

# 3. Generate a GPU Roofline report:    
advixe-cl --report=roofline --gpu --project-dir=./adv

# 4. Open the generated roofline.html in a web browser to visualize GPU performance.    
```

![](/files/Cix6kjzb88RjPKoW0p7Y)

#### Vectorization

![](/files/BmukCFFOg3HIG39LajtN)

#### Offload Modeling

The easiest way to run Offload Advisor is to use the batch mode that consists in running 2 scripts available is the folder $APM ($APM is available when Advisor is sourced).

* collect.py: Used to collect data such as timing, flops, tripcounts and many more
* analyze.py: Creating the report

To be more specific, collect.py runs the following analyses:

* survey: Timing your application functions and loops, reading compiler diagnostics
* tripcount: With flops and cache simulation to count the number of iterations in the loops as well as the number of operations and memory transfers
* dependency: Check if you have data dependency in your loops, preventing it to be good candidates for offloading or vectorization

Offload Advisor is currently run from the command-line as below. Once the run is complete you can view the generated report.html.

```bash
#!/bin/bash

advixe-python $APM/collect.py advisor_project --config gen9 -- ./matrix.dpcpp
advixe-python $APM/analyze.py advisor_project --config gen9 --out-dir ./analyze
```

![](/files/j5PbAbISXACBK4AaBU2K)

## L5 GPU Optimization Guide

* `Amdahl's Law`: 并行加速比取决于可并行部分的比重。
  * $acc = \frac{W\_s+W\_p}{W\_s+\frac{W\_p}{p}}$
* `Locality Matters`: 加速卡的专用内存也有层级
  * latency: register > cache > DRAM/HBM
  * 三原则：
    * 尽可能在加速器上保留数据
    * 内核执行时访问连续的内存块
    * 重构代码以获得更高的数据重用

#### GPU Architecture

See [reference](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-gpu-optimization-guide/top/xe-arch.html).

* Xe Tile
  * 4HBM
  * 64GB Memory
  * Up to 4096 SMT threads
  * 4 Xe Slice
    * 16 Xe Core (a.k.a Dual Sub Slice(DSS) or Sub Slice(SS))
      * **一次分配一个 work-group**
      * 8 vector engine (a.k.a. Execution Unit(EU))
        * 执行 SIMD 指令，一个时钟周期执行 32 个数据的指令。
        * **一次分配一个 sub-group**
        * 512 FP16 operations/cycle
        * 256 FP32 operations/cycle
        * 256 FP64 operations/cycle
      * 8 matrix engine
        * 8192 INT8 operations/cycle
        * 4096 BF16 operations/cycle
        * 2048 FP32 operations/cycle
      * 512K L1 cache/SLM
      * 512B/cycle load/store memory access bandwidth
* 本次 PAC 是内部版本（ATS-P），有两个 Tile，每个含 30 Xe Core，即 480个 EU。共 960 EU。
* What is EU?

  > An Execution Unit (EU) is the smallest thread-level building block of the Intel® Iris® Xe-LP GPU architecture. Each EU is simultaneously multithreaded (SMT) with seven threads. The primary computation unit consists of a 8-wide Single Instruction Multiple Data (SIMD) Arithmetic Logic Units (ALU) supporting SIMD8 FP/INT operations and a 2-wide SIMD ALU supporting SIMD2 extended math operations. Each hardware thread has 128 general-purpose registers (GRF) of 32B wide.

Tile -> Slice -> Core -> EU ![](/files/9H2fNxTnLlQXXO8kkoms) ![](/files/TALPbnGReVagvcbL0FS6) ![](/files/W5aLg7qdLjzYgnYhBRLf) ![](/files/3OMnq2XHZGO3PLsE1fAF)

#### Thread mapping and GPU utilization

See [Reference](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-gpu-optimization-guide/top/thread-mapping.html)

* Thread context: 线程上下文，内核应该有足够多的线程来占满 GPU context
  * EU 上可放线程数量：Xe-LP 7 thread, Xe-HPC 8 thread
* SIMD unit: 程序应该向量化利用 SIMD 指令。
  * Xe-LP 执行 Int 是 SIMD32。
  * 在 ATS-P 上，FP32 和 FP64 都是 SIMD16。
* SYCL 中限制 workgroup size 是 512。

**Calculation**: 以 Xe-LP(Low Power低功耗家用版本) 为例，其上有 6 个 sub-slice，每个 sub-slice 上有 16 个 EU，每个 EU 有 7 个 thread。下面的例子中 sub-group 大小为 8。

```cpp
auto command_group =
    [&](auto &cgh) {
      cgh.parallel_for(nd_range(sycl::range(64, 64, 128), // global range
                                sycl::range(1, R, 128)    // local range
                                ),
                       [=](sycl::nd_item<3> item) [[intel::reqd_sub_group_size(8)]] {
                         // (kernel code)
                         // Internal synchronization
                         item.barrier(access::fence_space::global_space);
                         // (kernel code)
                       })
    }
```

|    Work-items    |  Group Size | Threads | Xe-core Util |       Xe-core Occu      |
| :--------------: | :---------: | :-----: | :----------: | :---------------------: |
| 64x64x128=524288 |  (R=1) 128  |    16   |  16/112=14%  | 100% with 7 work-groups |
| 64x64x128=524288 | (R=2) 128x2 | 2x16=32 | 32/112=28.6% |  86% with 3 work-groups |
| 64x64x128=524288 | (R=3) 128x3 | 3x16=48 | 48/112=42.9% |  86% with 2 work-groups |
| 64x64x128=524288 | (R=4) 128x4 | 4x16=64 |  64/112=57%  |       57% maximum       |
| 64x64x128=524288 |  (R=5) 640+ |         |              |      Fail to launch     |

**Explanation**: 当 group size = 128，每个 Sub-slice 每次分配到 128 个 item，因为 sub-group = 8，所以需要 128/8=16 个 EU 完成计算。每个 Sub-slice 上刚好有 16个 EU，因此每个 EU 只需要一个 Thread 即可完成，共用 16 Threads。\
所以 Thread 对应的 Xe-core Util 计算为: $\frac{16Thread}{7\frac{Threads}{EU}\times16EU}=\frac{1}{7}=14.2%$\
因为任务总量足够多(64x64x128=524288)，因此一次 Dispatch 可以放满所有的 Threads，Xe-core Occupation 为 100%。\
![](/files/APF8I49xwD724vJ4lDco)

当 group size = 512，每个 Sub-slice 每次分配到 512 个 item，因为 sub-group = 8，因此需要 512/8=64 个 EU 完成计算。每个 Sub-slice 上有 16 个 EU，因此每个 EU 需要压 64/16=4 个线程，共用 64 Threads。 所以 Thread 对应的 Xe-core Util 计算为：$\frac{64Thread}{7\frac{Threads}{EU}\times16EU}=\frac{4}{7}=57%$\
即使任务总量足够多，但是每个 workgroup 都至少需要 EU 压上 4 个线程，对于只有 7 个 thread context 的 Xe-LP 而言，无法容纳第二个 workgroup，因此每次 Dispatch 只能分配 1个 workgroup，Xe-core Occupation 为 $\frac{4}{7}=57%$ ![](/files/1P0uEYHU2N49AZWEp1zR)

**Conclusion**： 因为 Dispatch 会有额外开销，所以尽量提高 Xe-Occupation 以实现最佳性能。

**Exercise**: ATS-P 有两个 Tile，每个 Tile 上 30 个 Xe-core，每个 Xe-core 上 16 个 EU，即 480 个 EU，共 960 个 EU。

* 在 ATS-P 上，假设总数据量是 96x512，数据类型是 FP64
  * Q1: 当 workgroup size = 512 时，一个 workgroup 给 Xe-core 上分配几个线程？
  * Q2: 一个 Xe-core 可以分配多少个 workgroup？
  * Q3: 把全部数据分配到 ATS-P 上，需要几次 dispatch？
* A1: 对于数据类型是 FP64 的可以使用 SIMD16，所以可以设置 sub-group size=16，每次可以执行 512/16=32 个线程。
* A2: ATS-P 每个 Xe-core 上一共有 16\*8=128 个线程，因此可以分配 128/32=4 个 work-group。
* A3: 数据一共需要 96x512/512=96 个 work-group。每次每个 Xe-core 可以分配4个，一次可以分配 2\*30\*4=240 个 work-group。因此理论上一次可以分配完。

#### SLM (Shared Local Memory)

* 所有 work item 之间通信/共享：使用 global memory，带宽低延迟高。
* 一个 subgroup(EU) 里面的 work item 可以共享数据，但是范围太小。
* 一个 workgroup(Xe-core) 里可以共享数据，使用 SLM。 ATS-P 上有 64K 大小的 SLM。
* SLM 是显式的调用，通过 `sycl::malloc_shared`，或者 `accessor<..., target::local>` 。
* 可以根据 work item 使用的空间大小确定一个 Xe-core 上可以放的 work group size。例如一个 work item 使用 512B 的数据的话，workgroup size最大为 65536/512=128。此时 thread context 无法使用满。
* 如何设置每个 work item 使用的数据大小来填满 thread context？
  * Xe-core 有 16EU x 8 Threads = 128 Threads，每个 Thread 可以处理 SIMD16 的 FP32/FP64，因此设置 sub-group size 为 16，则可以处理 128x16=2048 个 work-item。 65536/2048=32B/work item。
* SLM Bank Conflict: SLM 上的 Cache Line 也是 64 字节，被分为 16 个 Bank，每个 Bank 有 4 个字节。16 个 work item可以并行访问 16 个 Bank。但是访问同一 Bank 里面的不同地址会导致 Bank Conflict。

#### Multi-card usage

See Reference:[Considerations for programming to multi-tile and multi-card under Level-Zero backend](https://intel.github.io/llvm-docs/MultiTileCardWithLevelZero.html).\
Example of fetching all available devices in machine(here we only collect level-zero devices):

```cpp
// Initialize the available queues
std::vector<queue> qs;
auto platforms = platform::get_platforms();
for (auto& p : platforms) {
        auto pp = p.get_info<info::platform::name>();
        scout << "Platform: " << pp << sendl;
        std::string level_zero = "Level-Zero";
        if (pp.substr(pp.size() - level_zero.size()).compare(level_zero) == 0) {
                auto devices = p.get_devices();
                for (auto& d : devices) {
                        scout << "Device: " << d.get_info<info::device::name>() << sendl;
                        if (d.is_gpu()) {
                                scout << "Device is GPU - adding to vector of queues" << sendl;
                                qs.push_back(queue(d));
                        }
                }
        }
}
assert(qs.size() == world_size);
queue q = qs[my_rank];
```

## L6.1 Report

1. 每一页写优化前后效果
2. 使用的工具分析需要贴图
3. 新设立 oneapi 奖项

![](/files/d0OkB8U9sv5C8QfvguCk)

## L6.2 Optimization Examples

### PAC2021 初赛优化样例

#### 向量化 report

通过 `-qopt-report=5` 生成向量化报告，通过报告发现向量化不理想，于是对函数进行拆分，再继续通过 `-qopt-zmm-usage=high` 编译强制编译器使用高位宽指令集。

生成报告步骤：

1. `icpx -fiopenmp -fopenmp-target=spir64 -qopt-report=5 foo.c` 生成两个 `yaml` 文件，其中一个名为 `foo.opt.yaml`。
2. 使用 `oneapi` 自带的 `opt-viewer.py` 生成 `html` 格式的文件：`opt-viewer.py foo.opt.yaml` 输出 `foo.c.html`
3. 使用 `web` 浏览器可以打开查看报告。

#### MPI+OMP 混合并行

基于平台提供的众核的特点，使用混合并行，每个 MPI 程序每次处理一张图片，图片内部使用 OMP 进行并行，最后通过 Reduction 归约操作汇总到一个进程中写入结果文件。

### GPU Offload 样例

* 对于需要在 GPU 上使用 `openmp` 的程序而言，在编译的时候需要加入 `-fiopenmp -fopenmp-target=spir64` 的选项。
* 对于 `omp` 难以调试的问题可以使用环境变量 `LIBOMPTARGET_DEBUG=1/2` 来开启 debug 级别 log。

OpenMP 制导语句中的 map 表示数据的传输方向（一维数组）：

1. `to`: copy to device
2. `from`: copy back to host
3. `tofrom`: copy to and back
4. `alloc`: allocate on device
5. `release`: release the data allocated on device

```cpp
#pragma omp target map(to:x[0:n]) map(tofrom:y[0:n]) // 指定函数在 target device 上进行计算
#pragma omp parallel for
for (int i = 0; i <N; i++) {
    y[i] += a*x[i];
}
```

对于高维数组而言：

```cpp
float *A[ny];
for (int j = 0; j < ny; j++) {
    A[j] = (float*)omp_target_alloc_device(nx*h*sizeof(float), deviceID);
    omp_target_memcpy(stack_recon[j], stack_recon_device[j], nx*h*sizeof(float), 0, 0, 0, deviceID);
} 
```

#### FFT offloading

```cpp
#pragma omp target data map(tofrom:x[0:N[0]*N[1]]) device(devNum)
{
    #pragma omp target variant dispatch use_device_ptr(x) device(devNum)
    {
        forward_plan = fftwf_plan_dft(2, N, x, x, FFTW_FORWARD, FFTW_ESTIMATE);
    }
    #pragma omp target variant dispatch device(devNum)
    {
        fftwf_execute(forward_plan);
    }
}
```

#### 异步计算

通过 `omp` 的 `nowait` 可以实现派发任务到 target device 后，CPU 和 GPU 同时计算。\
通过 `depend` 实现依赖控制。

```cpp
#pragma omp declare target // 指导以下代码都在 target 上使用 omp
#include <stdlib.h>
#include <omp.h>
extern void compute(float*, float*, int);
#pragma omp end declare target

void vec_mult_async(float* p, float* v1, float* v2, int N) {
    int i;
    
    #pragma omp target enter data map(alloc: v1[:N], v2[:N])
    
    #pragma omp target nowait depend(out: v1, v2)
    compute(v1, v2, N);

    #pragma omp task
    other_work_on_cpu(); // execute asyn on host device

    #pragma omp target map(from: p[0:N]) nowait depend(in: v1, v2)
    {
        #pragma omp distribute parallel for
        for (int i = 0; i < N; i++) {
            p[i] = v1[i] + v2[i];
        }
    }
    #pragma omp target exit data map(release: v1[:N], v2[:N])
}
```

## Others

### Using DPC++ with MPI

See [reference](https://www.intel.com/content/www/us/en/developer/articles/technical/compile-and-run-mpi-programs-using-dpcpp-language.html).

```bash
export I_MPI_CXX=dpcpp
mpiicpc -fsycl -std=c++17 -lsycl -ltbb main.cpp -o dpc_reduce
```

## **The End**


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://legacy.cookielau.com/archives/9-sc/2022/0-lectures.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
