# Memory Classification

Memory Classification

2020 Jan 29th CookieLau

## Preface

![](https://2161500321-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M0bSIkrSKJhpcDpbSbW%2Fsync%2F78f87c5460956460e8c165f4d6f5d6e24a319b01.png?generation=1589303190981401\&alt=media)

## Register

## Local Memory

## Shared Memory

详见 014 Using Shared Memory

## Global Memory

Global Memory 讲究的是 coalesce ，也就是合并存取，即在 global memory 上面访存每次都是取连续的 32-bytes 的倍数的段，所以当可以合并存取的时候就合并存取，有利于提高带宽，举个例子： 一个 warp 中的 threads 访存连续的 4-bytes 大小的数据，则 global memory 上存取的可能是如下： 从 96 到 224 bytes 一共 32\*4=128 bytes 都被取到，一共取了 4个 32-bytes，是 coalesced access ![](https://2161500321-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M0bSIkrSKJhpcDpbSbW%2Fsync%2F7ceeeefc21f0c769f8efc0e0f33332b9400b9aaa.png?generation=1589303190858594\&alt=media)

而如果比如说一个 warp 中的 threads 取连续的但不是 32-bytes 的倍数的大小的数据时候，可能多出来的不够 32-bytes 的部分 **仍要** 从 global memory 处拿 32-bytes，这样就有一部分的浪费，导致带宽的消耗。

![](https://2161500321-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M0bSIkrSKJhpcDpbSbW%2Fsync%2F4029652150068117497b366e08839c37a84ae961.png?generation=1589303190836594\&alt=media)

Cuda 自身的 API 是很注重对齐的，比如 cudaMalloc 分配的内存都是对齐 256-bytes，所以我们在访存的时候也要注意这个特点，比如将 block thread 的size分配的合理一些，warp 的倍数，32的倍数之类的。

比如比较推荐的一种方法就是：

```c
int deviceId;
int numberOfSMs;

cudaGetDevice(&deviceId);
cudaDeviceGetAttribute(&numberOfSMs, cudaDevAttrMultiProcessorCount, deviceId);

size_t threadsPerBlock;
size_t numberOfBlocks;

threadsPerBlock = 256;
numberOfBlocks = 32 * numberOfSMs;

kernel<<<numberOfBlocks, threadsPerBlock>>>(...)
```

其实大小没有很大的影响，32的倍数即可。

## Constant Memory

## Texture Memory

## Reference

1.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://legacy.cookielau.com/archives/10-hardware/0-gpu/008-memoryclassification.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
