Difference between cuda.h, cuda_runtime.h, cuda_runtime_api.h

In very broad terms: cuda.h defines the public host functions and types for the CUDA driver API. cuda_runtime_api.h defines the public host functions and types for the CUDA runtime API cuda_runtime.h defines everything cuda_runtime_api.h does, as well as built-in type definitions and function overlays for the CUDA language extensions and device intrinsic functions. If you …

Read more

Default Pinned Memory Vs Zero-Copy Memory

I think it depends on your application (otherwise, why would they provide both ways?) Mapped, pinned memory (zero-copy) is useful when either: The GPU has no memory on its own and uses RAM anyway You load the data exactly once, but you have a lot of computation to perform on it and you want to …

Read more

What is the purpose of using multiple “arch” flags in Nvidia’s NVCC compiler?

Roughly speaking, the code compilation flow goes like this: CUDA C/C++ device code source –> PTX –> SASS The virtual architecture (e.g. compute_20, whatever is specified by -arch compute…) determines what type of PTX code will be generated. The additional switches (e.g. -code sm_21) determine what type of SASS code will be generated. SASS is …

Read more

Thrust inside user written kernels

As it was originally written, Thrust is purely a host side abstraction. It cannot be used inside kernels. You can pass the device memory encapsulated inside a thrust::device_vector to your own kernel like this: thrust::device_vector< Foo > fooVector; // Do something thrust-y with fooVector Foo* fooArray = thrust::raw_pointer_cast( fooVector.data() ); // Pass raw array and …

Read more

CUDA: How to use -arch and -code and SM vs COMPUTE

Some related questions/answers are here and here. I am still not sure how to properly specify the architectures for code generation when building with nvcc. A complete description is somewhat complicated, but there are intended to be relatively simple, easy-to-remember canonical usages. Compile for the architecture (both virtual and real), that represents the GPUs you …

Read more

CUDA model – what is warp size?

Direct Answer: Warp size is the number of threads in a warp, which is a sub-division used in the hardware implementation to coalesce memory access and instruction dispatch. Suggested Reading: As @Matias mentioned, I’d go read the CUDA C Best Practices Guide (you’ll have to scroll to the bottom where it’s listed). It might help …

Read more