Replies: 1 comment 4 replies
-
We have two ways to express parallel kernels in TornadoVM:
You can use any of those in your program. Using Option 1For your use case, if we use option 1: void reduction(float[] input, @Reduce float[] output) {
for (@Parallel int i = 0; i < input.length(); i++) {
output[0] += i * 2;
}
} TornadoVM will generate a few kernels under the hoods. Reductions in OpenCL/CUDA. etc are not that trivial. Using Option 2With this option, you are programming the parallel reduction directly from Java. Note that this code is not semantically correct if you run it in pure Java. You need the TornadoVM runtime to run this expression: // Example of a reduction using GPU's local memory (OpenCL) / shared memory (CUDA)
public static void reductionLocal(float[] a, float[] b, int localSize, KernelContext context) {
int globalIdx = context.globalIdx;
int localIdx = context.localIdx;
int localGroupSize = context.localGroupSizeX;
int groupID = context.groupIdx; // Expose Group ID
float[] localA = context.allocateFloatLocalArray(256);
localA[localIdx] = a[globalIdx];
// Reduction
for (int stride = (localGroupSize / 2); stride > 0; stride /= 2) {
context.localBarrier(); // Barrier
if (localIdx < stride) {
localA[localIdx] *= localA[localIdx + stride] * 2;
}
}
// Store the result in the first position of the work-group
if (localIdx == 0) {
b[groupID] = localA[0];
}
}
// How to invoke this code:
WorkerGrid worker = new WorkerGrid1D(size);
GridScheduler gridScheduler = new GridScheduler();
gridScheduler.setWorkerGrid("s0.t0", worker);
KernelContext context = new KernelContext();
TaskGraph taskGraph = new TaskGraph("s0") //
.transferToDevice(DataTransferMode.EVERY_EXECUTION, input, localSize)//
.task("t0", ReductionsLocalMemory::reductionLocal, input, reduce, localSize, context) //
.transferToHost(DataTransferMode.EVERY_EXECUTION, reduce);
// Change the Grid
worker.setLocalWork(localSize, 1, 1);
ImmutableTaskGraph immutableTaskGraph = taskGraph.snapshot();
TornadoExecutionPlan executor = new TornadoExecutionPlan(immutableTaskGraph);
executor.withGridScheduler(gridScheduler).execute();
// The reduce variable will have partial reductions: We need another reduction for all values stored in the output array
float sum = reduce[0];
for (int i = 1; i < reduce.length(); i++) {
sum += reduce[i] * 2;
} We have plenty of examples of the Kernel Context API and the Loop Parallel annotations in our test-suite:
|
Beta Was this translation helpful? Give feedback.
-
For non expert GPU developers employing loop parallelization via annotations is there a simple way to synchronize e.g. like lock an object the default way it is done in java
or simply employ atomic fields
Beta Was this translation helpful? Give feedback.
All reactions