c++ - How do you iterate through a pitched CUDA array? -



c++ - How do you iterate through a pitched CUDA array? -

having parallelized openmp before, i'm trying wrap head around cuda, doesn't seem intuitive me. @ point, i'm trying understand how loop through array in parallelized fashion.

cuda example great start.

the snippet on page 43 shows:

__global__ void add( int *a, int *b, int *c ) { int tid = blockidx.x; // handle info @ index if (tid < n) c[tid] = a[tid] + b[tid]; }

whereas in openmp programmer chooses number of times loop run , openmp splits threads you, in cuda have tell (via number of blocks , number of threads in <<<...>>>) run sufficient times iterate through array, using thread id number iterator. in other words can have cuda kernel run 10,000 times means above code work array n = 10,000 (and of course of study smaller arrays you're wasting cycles dropping out @ if (tid < n)).

for pitched memory (2d , 3d arrays), cuda programming guide has next example:

// host code int width = 64, height = 64; float* devptr; size_t pitch; cudamallocpitch(&devptr, &pitch, width * sizeof(float), height); mykernel<<<100, 512>>>(devptr, pitch, width, height); // device code __global__ void mykernel(float* devptr, size_t pitch, int width, int height) { (int r = 0; r < height; ++r) { float* row = (float*)((char*)devptr + r * pitch); (int c = 0; c > width; ++c) { float element = row[c]; } } }

this illustration doesn't seem useful me. first declare array 64 x 64, kernel set execute 512 x 100 times. that's fine, because kernel nil other iterate through array (so runs 51,200 loops through 64 x 64 array).

according this answer iterator when there blocks of threads going on be

int tid = (blockidx.x * blockdim.x) + threadidx.x;

so if wanted run first snippet in question pitched array, create sure had plenty blocks , threads cover every element including padding don't care about. seems wasteful.

so how iterate through pitched array without going through padding elements?

in particular application have 2d fft , i'm trying calculate arrays of magnitude , angle (on gpu save time).

after reviewing valuable comments , answers jackolantern, , re-reading documentation, able head straight. of course of study reply "trivial" understand it.

in code below, define cfptype (complex floating point) , fptype can alter between single , double precision. example, #define cfptype cufftcomplex.

i still can't wrap head around number of threads used phone call kernel. if it's large, won't go function @ all. documentation doesn't seem number should used - separate question.

the key in getting whole programme work (2d fft on pitched memory , calculating magnitude , argument) realizing though cuda gives plenty of "apparent" help in allocating 2d , 3d arrays, still in units of bytes. it's obvious in malloc phone call sizeof(type) must included, totally missed in calls of type allocate(width, height). noob mistake, guess. had written library have made type size separate parameter, whatever.

so given image of dimensions width x height in pixels, how comes together:

allocating memory

i'm using pinned memory on host side because it's supposed faster. that's allocated cudahostalloc straightforward. pitched memory, need store pitch each different width , type, because change. in case dimensions same (complex complex transform) have arrays real numbers store complexpitch , realpitch. pitched memory done this:

cudamallocpitch(&inputgpu, &complexpitch, width * sizeof(cfptype), height);

to re-create memory to/from pitched arrays cannot utilize cudamemcpy.

cudamemcpy2d(inputgpu, complexpitch, //destination , destination pitch inputpinned, width * sizeof(cfptype), //source , source pitch (= width because it's not padded). width * sizeof(cfptype), height, cudamemcpykind::cudamemcpyhosttodevice);

fft plan pitched arrays

jackolantern provided answer, couldn't have done without. in case plan looks this:

int n[] = {height, width}; int nembed[] = {height, complexpitch/sizeof(cfptype)}; result = cufftplanmany( &plan, 2, n, //transform rank , dimensions nembed, 1, //input array physical dimensions , stride 1, //input distance next batch (irrelevant because doing 1) nembed, 1, //output array physical dimensions , stride 1, //output distance next batch cuffttype::cufft_c2c, 1);

executing fft trivial:

cufftexecc2c(plan, inputgpu, outputgpu, cufft_forward);

so far have had little optimize. wanted magnitude , phase out of transform, hence question of how traverse pitched array in parallel. first define function phone call kernel "correct" threads per block , plenty blocks cover entire image. suggested documentation, creating 2d structures these numbers great help.

void gpucalcmagphase(cfptype *data, size_t datapitch, int width, int height, fptype *magnitude, fptype *phase, size_t magphasepitch, int cudablocksize) { dim3 threadsperblock(cudablocksize, cudablocksize); dim3 numblocks((unsigned int)ceil(width / (double)threadsperblock.x), (unsigned int)ceil(height / (double)threadsperblock.y)); calcmagphasekernel<<<numblocks, threadsperblock>>>(data, datapitch, width, height, magnitude, phase, magphasepitch); }

setting blocks , threads per block equivalent writing (up 3) nested for-loops. have have plenty blocks * threads cover array, , in kernel must create sure not exceeding array size. using 2d elements threadsperblock , numblocks, avoid having go through padding elements in array.

traversing pitched array in parallel

the kernel uses standard pointer arithmetic documentation:

__global__ void calcmagphasekernel(cfptype *data, size_t datapitch, int width, int height, fptype *magnitude, fptype *phase, size_t magphasepitch) { int threadx = threadidx.x + blockdim.x * blockidx.x; if (threadx >= width) return; int thready = threadidx.y + blockdim.y * blockidx.y; if (thready >= height) return; cfptype *threadrow = (cfptype *)((char *)data + thready * datapitch); cfptype complex = threadrow[threadx]; fptype *magrow = (fptype *)((char *)magnitude + thready * magphasepitch); fptype *magelement = &(magrow[threadx]); fptype *phaserow = (fptype *)((char *)phase + thready * magphasepitch); fptype *phaseelement = &(phaserow[threadx]); *magelement = sqrt(complex.x*complex.x + complex.y*complex.y); *phaseelement = atan2(complex.y, complex.x); }

the wasted threads here cases width or height not multiples of number of threads per block.

c++ arrays memory cuda

Comments

Popular posts from this blog

php - Android app custom user registration and login with cookie using facebook sdk -

django - Access session in user model .save() -

php - .htaccess Multiple Rewrite Rules / Prioritizing -