c++ - How do you iterate through a pitched CUDA array? -
c++ - How do you iterate through a pitched CUDA array? -
having parallelized openmp before, i'm trying wrap head around cuda, doesn't seem intuitive me. @ point, i'm trying understand how loop through array in parallelized fashion.
cuda example great start.
the snippet on page 43 shows:
__global__ void add( int *a, int *b, int *c ) { int tid = blockidx.x; // handle info @ index if (tid < n) c[tid] = a[tid] + b[tid]; }
whereas in openmp programmer chooses number of times loop run , openmp splits threads you, in cuda have tell (via number of blocks , number of threads in <<<...>>>
) run sufficient times iterate through array, using thread id number iterator. in other words can have cuda kernel run 10,000 times means above code work array n = 10,000 (and of course of study smaller arrays you're wasting cycles dropping out @ if (tid < n)
).
for pitched memory (2d , 3d arrays), cuda programming guide has next example:
// host code int width = 64, height = 64; float* devptr; size_t pitch; cudamallocpitch(&devptr, &pitch, width * sizeof(float), height); mykernel<<<100, 512>>>(devptr, pitch, width, height); // device code __global__ void mykernel(float* devptr, size_t pitch, int width, int height) { (int r = 0; r < height; ++r) { float* row = (float*)((char*)devptr + r * pitch); (int c = 0; c > width; ++c) { float element = row[c]; } } }
this illustration doesn't seem useful me. first declare array 64 x 64, kernel set execute 512 x 100 times. that's fine, because kernel nil other iterate through array (so runs 51,200 loops through 64 x 64 array).
according this answer iterator when there blocks of threads going on be
int tid = (blockidx.x * blockdim.x) + threadidx.x;
so if wanted run first snippet in question pitched array, create sure had plenty blocks , threads cover every element including padding don't care about. seems wasteful.
so how iterate through pitched array without going through padding elements?
in particular application have 2d fft , i'm trying calculate arrays of magnitude , angle (on gpu save time).
after reviewing valuable comments , answers jackolantern, , re-reading documentation, able head straight. of course of study reply "trivial" understand it.
in code below, define cfptype
(complex floating point) , fptype
can alter between single , double precision. example, #define cfptype cufftcomplex
.
i still can't wrap head around number of threads used phone call kernel. if it's large, won't go function @ all. documentation doesn't seem number should used - separate question.
the key in getting whole programme work (2d fft on pitched memory , calculating magnitude , argument) realizing though cuda gives plenty of "apparent" help in allocating 2d , 3d arrays, still in units of bytes. it's obvious in malloc phone call sizeof(type)
must included, totally missed in calls of type allocate(width, height)
. noob mistake, guess. had written library have made type size separate parameter, whatever.
so given image of dimensions width x height
in pixels, how comes together:
allocating memory
i'm using pinned memory on host side because it's supposed faster. that's allocated cudahostalloc
straightforward. pitched memory, need store pitch each different width , type, because change. in case dimensions same (complex complex transform) have arrays real numbers store complexpitch
, realpitch
. pitched memory done this:
cudamallocpitch(&inputgpu, &complexpitch, width * sizeof(cfptype), height);
to re-create memory to/from pitched arrays cannot utilize cudamemcpy
.
cudamemcpy2d(inputgpu, complexpitch, //destination , destination pitch inputpinned, width * sizeof(cfptype), //source , source pitch (= width because it's not padded). width * sizeof(cfptype), height, cudamemcpykind::cudamemcpyhosttodevice);
fft plan pitched arrays
jackolantern provided answer, couldn't have done without. in case plan looks this:
int n[] = {height, width}; int nembed[] = {height, complexpitch/sizeof(cfptype)}; result = cufftplanmany( &plan, 2, n, //transform rank , dimensions nembed, 1, //input array physical dimensions , stride 1, //input distance next batch (irrelevant because doing 1) nembed, 1, //output array physical dimensions , stride 1, //output distance next batch cuffttype::cufft_c2c, 1);
executing fft trivial:
cufftexecc2c(plan, inputgpu, outputgpu, cufft_forward);
so far have had little optimize. wanted magnitude , phase out of transform, hence question of how traverse pitched array in parallel. first define function phone call kernel "correct" threads per block , plenty blocks cover entire image. suggested documentation, creating 2d structures these numbers great help.
void gpucalcmagphase(cfptype *data, size_t datapitch, int width, int height, fptype *magnitude, fptype *phase, size_t magphasepitch, int cudablocksize) { dim3 threadsperblock(cudablocksize, cudablocksize); dim3 numblocks((unsigned int)ceil(width / (double)threadsperblock.x), (unsigned int)ceil(height / (double)threadsperblock.y)); calcmagphasekernel<<<numblocks, threadsperblock>>>(data, datapitch, width, height, magnitude, phase, magphasepitch); }
setting blocks , threads per block equivalent writing (up 3) nested for
-loops. have have plenty blocks * threads cover array, , in kernel must create sure not exceeding array size. using 2d elements threadsperblock
, numblocks
, avoid having go through padding elements in array.
traversing pitched array in parallel
the kernel uses standard pointer arithmetic documentation:
__global__ void calcmagphasekernel(cfptype *data, size_t datapitch, int width, int height, fptype *magnitude, fptype *phase, size_t magphasepitch) { int threadx = threadidx.x + blockdim.x * blockidx.x; if (threadx >= width) return; int thready = threadidx.y + blockdim.y * blockidx.y; if (thready >= height) return; cfptype *threadrow = (cfptype *)((char *)data + thready * datapitch); cfptype complex = threadrow[threadx]; fptype *magrow = (fptype *)((char *)magnitude + thready * magphasepitch); fptype *magelement = &(magrow[threadx]); fptype *phaserow = (fptype *)((char *)phase + thready * magphasepitch); fptype *phaseelement = &(phaserow[threadx]); *magelement = sqrt(complex.x*complex.x + complex.y*complex.y); *phaseelement = atan2(complex.y, complex.x); }
the wasted threads here cases width or height not multiples of number of threads per block.
c++ arrays memory cuda
Comments
Post a Comment