Tensor Program Abstraction

Credit to: mlc.ai by Tianqi Chen

Before all, there is a way to print script in pretty way:

import IPython

IPython.display.Code(MyModule.script(), language="python")

Primitive Tensor Function

primitive tensor function

Primitive Tensor Function can be seen as a self-contained unit responsible for getting the input data, applying the execution, and outputting the expected result. Linear, Add, Relu can be seen as a primitive tensor function, fused function like linear_add or add_relu can also be seen as a primitive tensor function.

Primitive Tensor Function does not restrict its implementation, take add as an example, we can call torch.add or write a vanilla add using python, or even write a parallelized add with the aid of OpenMP.

Example of TPA

The typical Tensor Program Abstraction contains several parts: buffers, loop nests, and computation statement.

Essential classes

A schedule is a set of transformations that change the order of computation but preserve the semantics of computation.

TensorIR

Do2Learn

IRModule has attribute func::script to output the decorated function.

Staying in IRModule is not enough, if you want any transformation on the original code, initiate a class::Schedule to apply transformations.

With Schedule, code can be equipped with optimizations and can run.

For example, we can get the attributes of the code from sch using the func::get_xxx

When we get the loops, we can do many fancy things, e.g. unroll loop, vectorize, parallelize...

As we can see, the schedule holds a copy of IRModule with the name of mod. Every transformation will result in the change of mod.script().

There can be more transformations:

From the example above, we have a little comprehension of it:

  1. The computation part is bounded inside a block, like C in the example.

  2. Every attempt of optimization on the original script needs a concrete schedule to implement. (I can not find the reverse operation till now)

build and run

After we make some fancy change, we can export it to a runnable function using tvm:

To evaluate the performance of optimization, tvm provides a evaluator:

Case Study on MM_Relu

Low-level numpy implementation:

The differences between low-level numpy and IRModule are:

  1. variable type: tvm treats all variables as buffer, no matter parameters or temporary variables.

  2. for loop: T.grid is a sugar indicating nested loops.

  3. block: a block is a basic unit of computation in TensorIR, the computation parts are wrapped in annotated blocks like "Y" and "C", they can be retrieved by sch.get_block.

  4. extra information: the more information we provide to compiler, the more the compiler can do for us. Like the T.axis.spatial and T.axis.reduce respectively indicate sequence-irrelevant and reduction required.

Block Axis properties

Notably, for a fixed value of vi and vj, the computation block produces a point value at a spatial location of Y (Y[vi, vj]) that is independent from other locations in Y (with a different vi, vj values). we can call vi, vj spatial axes as they directly corresponds to the beginning of a spatial region of buffers that the block writes to. The axes that involves in reduction (vk) are named as reduce axes.

The extra information of axis can help us validate the correctness. For example, if the axis requires an iterator of 128 but binds to a for loop of range(127), it will raise an exception. Besides, extra information can let the compiler make more fancy things according to their dependency and relevance.

Sugars

The initialization of axises can be simplified as:

Function Annotations

Here are two attributes:

  • global_symbol: the name of function that is unique in this IRModule, will be used when retrieving function inside built Module.

  • tir.noalias: indicating that all the buffer memories do not overlap.

Transformation

If we apply the transformation of fully utilizing the cache of matrix B like this in low-level numpy:

Why cache is utilized? See this pic: tensor func loop order

In this case, we need to:

  1. split the j loop into two parts

  2. reorder the j loop with k loop

What's more, we can find that the block C and block Y shares part of the loop, therefore we can combine them together.

Last, we can separate the initialization part and the computation part of Y using func::decompose_reduction:

Another way to create and interact with TensorIR

Intro to Tensor Expression

Tensor expression (te) is a domain-specific language that describes a sequence of computations via an expression like API.

Here te.compute takes the signature te.compute(output_shape, fcompute). And the fcompute function (the lambda function in the example) describes how we want to compute the value of each element Y[i, j] for a given index.

In this particular case, we want to create a function with two input parameters (A, B) and one output parameter (C).

The tensor expression API provides a helpful tool to generate TensorIR functions for a given higher-level input.

Last updated

Was this helpful?