Work Sharing Directives
Experiment
If not adding the critical region, there will be data races between threads when reading and writing a shared variable pi
. In the critical region, threads are serial.
#pragma omp parallel
is spawning threads, #pragma omp for
is splitting for loop into many pieces, and #pragma omp parallel for
combines both together.
Lecture
Parallel do/for loops:
A parallel do/for loop divides up the iterations of the loop between threads.
There is an implicit synchronization point at the end of the loop: all thread must finish their iterations before any thread can proceed. (Except
nowait
identifier)Loop has to have determinable trip count (Since OpenMP directive will be translated in compile time.)
for (var = a; var logical-op b; incr-exp)
where
logical-op
is one of<, <=, >, >=
incr-exp
isvar = var +/- incr
or semantic equivalents ofvar++/--
Cannot modify
var
within the loop body.
Jumps out of the loop are not permitted.
How can you tell if a loop is parallel or not?
Useful test: if the loop gives the same answers if it is run in reverse order, then it is almost certainly parallel
In other words, there should be no data dependency between threads.
Clauses:
PRIVATE(var): var will be private in each thread, without initialization, i.e. random value. (loop index is PRIVATE by default)
FIRSTPRIVATE(var): var will be initialized with the value before parallelisation.
LASTPRIVATE(var): var will bring the last assigned value in parallelisation into the serial part.
REDUCTION(reduction-op:val): threads will execute reduction-op (e.g. +,*,max,min) on val.
schedule(kind[, chunksize]):
STATIC: divide iteration space equally and assign with order
DYNAMIC: divide iteration space equally and assign to threads on a first-come-fist-served basis.
GUIDED: similar to DYNAMIC, but the chunks start off large and get smaller exponentially. The size of next chunk is proportional to the number of remaining iterations divided by the number of threads. (i.e. remaining//threads). The chunksize specifies the minimum size of the chunks.
AUTO: lets the runtime have full freedom to choose its own assignment of iterations to threads. If the parallel loop is executed many times, the runtime can evolve a good schedule which has good load balance and low overheads.
RUNTIME: defer the choice of schedule to runtime, when it is determined by the value of the environment variable
OMP_SCHEDULE
. e.g.export OMP_SCHEDULE="guided,4"
Noted: illegal to specify a chunksize in the code with RUNTIME schedule.chunksize: if not specified, the iteration space is divided into approximately equal chunks, and one chunk is assigned to each thread in order (block schedule); if specified, the iteration space is divided into chunks, each of chunksize iterations, and the chunks are assigned cyclically to each thread in order (block cyclic schedule)
Which to use?
STATIC: best for load balanced loops - least overhead.
STATIC, n: good for loops with mild or smooth load imbalance, but can induce overheads. (multiple iterations at each time can mitigate load imbalance statistically.)
DYNAMIC: useful if iterations have widely varying loads, but ruins data locality.
GUIDED: often less expensive than DYNAMIC, but beware of loops where the first iterations are the most expensive! (Since it will cause huge load imbalance for the thread taking first few iterations.)
AUTO: may be useful if the loop is executed many times over.
Nested rectangular loops:
We can use
collapse(num_loops)
to parallelise multiple loops.Will form a single loop of the multiplication of length at each loop and parallelise that.
Useful is the outermost loop length N
Some Tips on Using Nested Parallelism from Sun Studio
Nesting parallel regions provides an immediate way to allow more threads to participate in the computation.
For example, suppose you have a program that contains two levels of parallelism and the degree of parallelism at each level is 2. Also, suppose your system has four cpus and you want use all four CPUs to speed up the execution of this program. Just parallelizing any one level will use only two CPUs. You want to parallelize both levels.
Nesting parallel regions can easily create too many threads and oversubscribe the system. Set
SUNW_MP_MAX_POOL_THREADS
andSUNW_MP_MAX_NESTED_LEVELS
appropriately to limit the number of threads in use and prevent runaway oversubscription.Creating nested parallel regions adds overhead. If there is enough parallelism at the outer level and the load is balanced, generally it will be more efficient to use all the threads at the outer level of the computation than to create nested parallel regions at the inner levels.
For example, suppose you have a program that contains two levels of parallelism. The degree of parallelism at the outer level is 4 and the load is balanced. You have a system with four CPUs and want to use all four CPUs to speed up the execution of this program. Then, in general, using all 4 threads for the outer level could yield better performance than using 2 threads for the outer parallel region, and using the other 2 threads as slave threads for the inner parallel regions.
SINGLE directive:
SINGLE
indicates that a block of code is to be executed by a single thread only.The first thread reaching the
SINGLE
directive will execute the block.There is an implicit synchronization point at the end of the block: all other threads will wait until the block has been executed.
As the
for
clause,SINGLE
also can takePRIVATE
andFIRSTPRIVATE
.Directive must contain a structured block, can not branch into or out of it.
MASTER directive:
MASTER
indicates that only thread with index 0 will execute the block.There is NO synchronisation point at the end of the block: other threads will skip the block and continue executing. (different from
SINGLE
)
SECTION directive:
Allow separate blocks of code to be executed in parallel (e.g. several independent subroutines)
There is a synchronisation point at the end of the blocks: all threads must finish their blocks before any thread can proceed.
Not scalable: the source code determines the amount of parallelism available.
Rarely used, except with nested parallelism.
Also can take
PRIVATE
andFIRSTPRIVATE
andLASTPRIVATE
Each section must contain a structured block: can not branch into or out of a section.
WORKSHARE directive: (Only available in FORTRAN)
Kind of IMPLICIT, can be replaced by explicit for/do clause. (Not encouraged to use.)
The workshare construct divides the execution of the enclosed structured block into separate units of work, and causes the threads of the team to share the work such that each unit is executed only once by one thread, in the context of its implicit task.
No Schedule clause: distribution of work units to threads is entirely up to the compiler!
There is a synchronisation point at the end of workshare: all threads must finish their work before any thread can proceed.
No function calls except array intrinsics and those declared ELEMENTAL
Examples:
Reference
Last updated