Tensor Program Abstraction
Credit to: mlc.ai by Tianqi Chen
Before all, there is a way to print script in pretty way:
import IPython
IPython.display.Code(MyModule.script(), language="python")Primitive Tensor Function

Primitive Tensor Function can be seen as a self-contained unit responsible for getting the input data, applying the execution, and outputting the expected result. Linear, Add, Relu can be seen as a primitive tensor function, fused function like linear_add or add_relu can also be seen as a primitive tensor function.
Primitive Tensor Function does not restrict its implementation, take add as an example, we can call torch.add or write a vanilla add using python, or even write a parallelized add with the aid of OpenMP.
Example of TPA
The typical Tensor Program Abstraction contains several parts: buffers, loop nests, and computation statement.
Essential classes
A schedule is a set of transformations that change the order of computation but preserve the semantics of computation.
TensorIR
Do2Learn
IRModule has attribute func::script to output the decorated function.
Staying in IRModule is not enough, if you want any transformation on the original code, initiate a class::Schedule to apply transformations.
With Schedule, code can be equipped with optimizations and can run.
For example, we can get the attributes of the code from sch using the func::get_xxx
When we get the loops, we can do many fancy things, e.g. unroll loop, vectorize, parallelize...
As we can see, the schedule holds a copy of IRModule with the name of mod. Every transformation will result in the change of mod.script().
There can be more transformations:
From the example above, we have a little comprehension of it:
The computation part is bounded inside a block, like
Cin the example.Every attempt of optimization on the original script needs a concrete schedule to implement. (I can not find the reverse operation till now)
build and run
After we make some fancy change, we can export it to a runnable function using tvm:
To evaluate the performance of optimization, tvm provides a evaluator:
Case Study on MM_Relu
Low-level numpy implementation:
The differences between low-level numpy and IRModule are:
variable type: tvm treats all variables as buffer, no matter parameters or temporary variables.
for loop:
T.gridis a sugar indicating nested loops.block: a block is a basic unit of computation in TensorIR, the computation parts are wrapped in annotated blocks like "Y" and "C", they can be retrieved by
sch.get_block.extra information: the more information we provide to compiler, the more the compiler can do for us. Like the
T.axis.spatialandT.axis.reducerespectively indicate sequence-irrelevant and reduction required.
Block Axis properties
Notably, for a fixed value of vi and vj, the computation block produces a point value at a spatial location of Y (Y[vi, vj]) that is independent from other locations in Y (with a different vi, vj values). we can call vi, vj spatial axes as they directly corresponds to the beginning of a spatial region of buffers that the block writes to. The axes that involves in reduction (vk) are named as reduce axes.
The extra information of axis can help us validate the correctness. For example, if the axis requires an iterator of 128 but binds to a for loop of range(127), it will raise an exception. Besides, extra information can let the compiler make more fancy things according to their dependency and relevance.
Sugars
The initialization of axises can be simplified as:
Function Annotations
Here are two attributes:
global_symbol: the name of function that is unique in this IRModule, will be used when retrieving function inside built Module.tir.noalias: indicating that all the buffer memories do not overlap.
Transformation
If we apply the transformation of fully utilizing the cache of matrix B like this in low-level numpy:
Why cache is utilized? See this pic: 
In this case, we need to:
split the j loop into two parts
reorder the j loop with k loop
What's more, we can find that the block C and block Y shares part of the loop, therefore we can combine them together.
Last, we can separate the initialization part and the computation part of Y using func::decompose_reduction:
Another way to create and interact with TensorIR
Intro to Tensor Expression
Tensor expression (te) is a domain-specific language that describes a sequence of computations via an expression like API.
Here te.compute takes the signature te.compute(output_shape, fcompute). And the fcompute function (the lambda function in the example) describes how we want to compute the value of each element Y[i, j] for a given index.
In this particular case, we want to create a function with two input parameters (A, B) and one output parameter (C).
The tensor expression API provides a helpful tool to generate TensorIR functions for a given higher-level input.
Last updated
Was this helpful?