Dispatching Commands

The dispatching commands described in this chapter provoke work in a compute pipeline. Dispatching commands are recorded into a command buffer and when executed by a queue, will produce work which executes according to the bound compute pipeline. A compute pipeline must be bound to a command buffer before any dispatching commands are recorded in that command buffer.

Dispatch Command for CUDA PTX Kernels

Compute kernels can be provided in SPIR-V or PTX code. When using PTX kernels the dispatch mechanism is different than with regular compute pipelines.

The way to create a PTX assembly file is beyond the scope of this documentation. For mode information, please refer to the CUDA toolkit documentation at https://docs.nvidia.com/cuda/.

Prior to using this command, you must initialize a CUDA module, and create a function handle that will serve as the entry point of the kernel to dispatch. See CUDA Modules.

The dispatching of a CUDA kernel is recorded into a command buffer, and when executed by a queue submit, will produce work which executes according to the bound compute pipeline.

Passing Dispatch Parameters and Arguments

Resource Sharing from Vulkan to the CUDA Kernel

Given that one key limitation of this extension is that Vulkan cannot access, nor bind any global resource of CUDA modules, the only way to exchange data with the kernel must be to pass resources via the arguments of the function.

You can use VK_KHR_buffer_device_address to write/read to/from a VkBuffer object. VK_KHR_buffer_device_address allows you to get the device address of the buffer to pass it as an argument into pParams. Application-side pointer arithmetic on the device address is legal, but will not be bounds-checked on the device.

The corresponding argument of the CUDA function should be declared as a pointer of the same type as the referenced buffer. CUDA code may simply read or write to this buffer in the typical C way.

You may also use VK_NVX_image_view_handle as another convenient way to read/write from/to a VkImage.

The corresponding argument of the CUDA function must be typed as cudaSurfaceObject_t.

You may read from it by using CUDA surface-read functions such as surf3Dread, surf2Dread, and surf1Dread
You may write to it by using CUDA surface-write functions such as surf3Dwrite, surf2Dwrite, and surf1Dwrite

Please refer to CUDA surface object documentation for more details

On Vulkan side, here is an example on how to setup VkImageViewHandleInfoNVX to query the handle for cudaSurfaceObject_t:

VkImageViewHandleInfoNVX imageViewHandleInfo = {VK_STRUCTURE_TYPE_IMAGE_VIEW_HANDLE_INFO_NVX};
imageViewHandleInfo.sampler = VK_NULL_HANDLE;
imageViewHandleInfo.descriptorType = VK_DESCRIPTOR_TYPE_STORAGE_IMAGE;
imageViewHandleInfo.imageView = imageViewIn; // the VkImageView we want to access
uint32_t myViewHandleIn = vkGetImageViewHandleNVX(m_device, &imageViewHandleInfo);
imageViewHandleInfo.imageView = imageViewOut; // the VkImageView we want to access
uint32_t myViewHandleOut = vkGetImageViewHandleNVX(m_device, &imageViewHandleInfo);

Here is an example of how to declare parameters for pParams

VkCudaLaunchInfoNV launchInfo = { VK_STRUCTURE_TYPE_CUDA_LAUNCH_INFO_NV };

int block_size = 8;
float dt = 1.0f / 60.0f;

const void* params[] =
{
  &dt,
  &uint32_t myViewHandleIn,
  &uint32_t myViewHandleOut
};

launchInfo.function = cudaFunction; // CUDA function previously created
launchInfo.gridDimX = (volumeTexDimensionNonBoundary / block_size);
launchInfo.gridDimY = (volumeTexDimensionNonBoundary / block_size);
launchInfo.gridDimZ = (volumeTexDimensionNonBoundary / block_size);
launchInfo.blockDimX = block_size;
launchInfo.blockDimY = block_size;
launchInfo.blockDimZ = block_size;
launchInfo.sharedMemBytes = 0;
launchInfo.paramCount = 3;
launchInfo.pParams = &params[0];
launchInfo.extraCount = 0;
launchInfo.pExtras = nullptr;

vkCmdCudaLaunchKernelNV(commandBuffer, &launchInfo);

In the CUDA kernel source code, here is an example on how arguments match pParams and how we can use Surface object:

extern "C"  __global__ void cudaFunction(
  float dt,
  cudaSurfaceObject_t volumeTexIn,
  cudaSurfaceObject_t volumeTexOut
  )
{
  int i = 1 + blockIdx.x * blockDim.x + threadIdx.x;
  int j = 1 + blockIdx.y * blockDim.y + threadIdx.y;
  int k = 1 + blockIdx.z * blockDim.z + threadIdx.z;

  float val;
  surf3Dread(&val, volumeTexIn, i * sizeof(float), j, k);
  ...
  float result = ...;
  // write result
  surf3Dwrite(result, volumeTexOut, i * sizeof(float), j, k);
}