The documentation is a bit sparse but readable. You simply define the function you want to execute and the dimensions of the problem. You can specify one, two, or three dimensions, as suits your problem space. When you execute the associated function it will try to run the kernels on your GPU in parallel. If it can’t, it will still get the right answer, just slowly.
There was a time buying a vector processor was a big purchase. Now there is one in your PC drawing your web pages. If you want a different approach you could always build your own cluster. You can even have a stack of Pis.