Dispatching Commands

The dispatching commands described in this chapter provoke work in a compute pipeline. Dispatching commands are recorded into a command buffer and when executed by a queue, will produce work which executes according to the bound compute pipeline. A compute pipeline must be bound to a command buffer before any dispatching commands are recorded in that command buffer.

vkCmdDispatchDispatch compute work items
VkDispatchIndirect2InfoKHRDispatch indirect info
vkCmdDispatchIndirect2KHRDispatch compute work items with indirect parameters through an address range
vkCmdDispatchIndirectDispatch compute work items with indirect parameters
VkDispatchIndirectCommandStructure specifying an indirect dispatching command
vkCmdDispatchBaseDispatch compute work items with non-zero base values for the workgroup IDs
vkCmdDispatchTileQCOMDispatch per-tile work items
VkDispatchTileInfoQCOMStructure specifying dispatch tile info
vkCmdSubpassShadingHUAWEIDispatch compute work items

Dispatch Command for CUDA PTX Kernels

Compute kernels can be provided in SPIR-V or PTX code. When using PTX kernels the dispatch mechanism is different than with regular compute pipelines.

The way to create a PTX assembly file is beyond the scope of this documentation. For mode information, please refer to the CUDA toolkit documentation at https://docs.nvidia.com/cuda/.

Prior to using this command, you must initialize a CUDA module, and create a function handle that will serve as the entry point of the kernel to dispatch. See CUDA Modules.

The dispatching of a CUDA kernel is recorded into a command buffer, and when executed by a queue submit, will produce work which executes according to the bound compute pipeline.

vkCmdCudaLaunchKernelNVDispatch compute work items

Passing Dispatch Parameters and Arguments

VkCudaLaunchInfoNVStructure specifying the parameters to launch a CUDA kernel

Resource Sharing from Vulkan to the CUDA Kernel

Given that one key limitation of this extension is that Vulkan cannot access, nor bind any global resource of CUDA modules, the only way to exchange data with the kernel must be to pass resources via the arguments of the function.

You can use VK_KHR_buffer_device_address to write/read to/from a VkBuffer object. VK_KHR_buffer_device_address allows you to get the device address of the buffer to pass it as an argument into pParams. Application-side pointer arithmetic on the device address is legal, but will not be bounds-checked on the device.

The corresponding argument of the CUDA function should be declared as a pointer of the same type as the referenced buffer. CUDA code may simply read or write to this buffer in the typical C way.

You may also use VK_NVX_image_view_handle as another convenient way to read/write from/to a VkImage.

The corresponding argument of the CUDA function must be typed as cudaSurfaceObject_t.

  • You may read from it by using CUDA surface-read functions such as surf3Dread, surf2Dread, and surf1Dread
  • You may write to it by using CUDA surface-write functions such as surf3Dwrite, surf2Dwrite, and surf1Dwrite

Please refer to CUDA surface object documentation for more details

On Vulkan side, here is an example on how to setup VkImageViewHandleInfoNVX to query the handle for cudaSurfaceObject_t:

VkImageViewHandleInfoNVX imageViewHandleInfo = {VK_STRUCTURE_TYPE_IMAGE_VIEW_HANDLE_INFO_NVX};
imageViewHandleInfo.sampler = VK_NULL_HANDLE;
imageViewHandleInfo.descriptorType = VK_DESCRIPTOR_TYPE_STORAGE_IMAGE;
imageViewHandleInfo.imageView = imageViewIn; // the VkImageView we want to access
uint32_t myViewHandleIn = vkGetImageViewHandleNVX(m_device, &imageViewHandleInfo);
imageViewHandleInfo.imageView = imageViewOut; // the VkImageView we want to access
uint32_t myViewHandleOut = vkGetImageViewHandleNVX(m_device, &imageViewHandleInfo);

Here is an example of how to declare parameters for pParams

VkCudaLaunchInfoNV launchInfo = { VK_STRUCTURE_TYPE_CUDA_LAUNCH_INFO_NV };

int block_size = 8;
float dt = 1.0f / 60.0f;

const void* params[] =
{
  &dt,
  &uint32_t myViewHandleIn,
  &uint32_t myViewHandleOut
};

launchInfo.function = cudaFunction; // CUDA function previously created
launchInfo.gridDimX = (volumeTexDimensionNonBoundary / block_size);
launchInfo.gridDimY = (volumeTexDimensionNonBoundary / block_size);
launchInfo.gridDimZ = (volumeTexDimensionNonBoundary / block_size);
launchInfo.blockDimX = block_size;
launchInfo.blockDimY = block_size;
launchInfo.blockDimZ = block_size;
launchInfo.sharedMemBytes = 0;
launchInfo.paramCount = 3;
launchInfo.pParams = &params[0];
launchInfo.extraCount = 0;
launchInfo.pExtras = nullptr;

vkCmdCudaLaunchKernelNV(commandBuffer, &launchInfo);

In the CUDA kernel source code, here is an example on how arguments match pParams and how we can use Surface object:

extern "C"  __global__ void cudaFunction(
  float dt,
  cudaSurfaceObject_t volumeTexIn,
  cudaSurfaceObject_t volumeTexOut
  )
{
  int i = 1 + blockIdx.x * blockDim.x + threadIdx.x;
  int j = 1 + blockIdx.y * blockDim.y + threadIdx.y;
  int k = 1 + blockIdx.z * blockDim.z + threadIdx.z;

  float val;
  surf3Dread(&val, volumeTexIn, i * sizeof(float), j, k);
  ...
  float result = ...;
  // write result
  surf3Dwrite(result, volumeTexOut, i * sizeof(float), j, k);
}

Compute Occupancy Priority

The VK_NV_compute_occupancy_priority extension provides applications with control over how their compute workloads utilize GPU compute resources, specifically allowing prioritization relative to other simultaneously executing workloads. Applications can specify the priority with which compute workloads should occupy GPU compute resources, allowing for a fine-grained distinction between workloads that may want to execute at a background priority over a long period of time versus workloads with harder latency requirements.

vkCmdSetComputeOccupancyPriorityNVSet the compute occupancy priority for subsequent compute dispatches
VkComputeOccupancyPriorityParametersNVStructure specifying compute occupancy priority parameters
VK_COMPUTE_OCCUPANCY_PRIORITY_LOW_NVLow occupancy priority constant
VK_COMPUTE_OCCUPANCY_PRIORITY_NORMAL_NVNormal occupancy priority constant
VK_COMPUTE_OCCUPANCY_PRIORITY_HIGH_NVHigh occupancy priority constant