How does CUDA structure computation?
CUDA broadly follows the data-parallel model of computation. Typically each thread executes the same operation on different elements of the data in parallel. The data is split up into a 1D or 2D grid of blocks. Each block can be 1D, 2D or 3D in shape, and can consist of up to 512 threads on current hardware. Threads within a thread block can coooperate via the shared memory. Thread blocks are executed as smaller groups of threads known as “warps”. The warp size is 32 threads on current hardware.