Difference between cuda.h, cuda_runtime.h, cuda_runtime_api.h

In very broad terms: cuda.h defines the public host functions and types for the CUDA driver API. cuda_runtime_api.h defines the public host functions and types for the CUDA runtime API cuda_runtime.h defines everything cuda_runtime_api.h does, as well as built-in type definitions and function overlays for the CUDA language extensions and device intrinsic functions. If you …

Read more

Default Pinned Memory Vs Zero-Copy Memory

I think it depends on your application (otherwise, why would they provide both ways?) Mapped, pinned memory (zero-copy) is useful when either: The GPU has no memory on its own and uses RAM anyway You load the data exactly once, but you have a lot of computation to perform on it and you want to …

Read more

CUDA and Classes

Define the class in a header that you #include, just like in C++. Any method that must be called from device code should be defined with both __device__ and __host__ declspecs, including the constructor and destructor if you plan to use new/delete on the device (note new/delete require CUDA 4.0 and a compute capability 2.0 …

Read more

What is the purpose of using multiple “arch” flags in Nvidia’s NVCC compiler?

Roughly speaking, the code compilation flow goes like this: CUDA C/C++ device code source –> PTX –> SASS The virtual architecture (e.g. compute_20, whatever is specified by -arch compute…) determines what type of PTX code will be generated. The additional switches (e.g. -code sm_21) determine what type of SASS code will be generated. SASS is …

Read more

Can/Should I run this code of a statistical application on a GPU?

UPDATE GPU Version __global__ void hash (float *largeFloatingPointArray,int largeFloatingPointArraySize, int *dictionary, int size, int num_blocks) { int x = (threadIdx.x + blockIdx.x * blockDim.x); // Each thread of each block will float y; // compute one (or more) floats int noOfOccurrences = 0; int a; while( x < size ) // While there is work …

Read more