Device-Generated Commands

This chapter discusses the generation of command buffer content on the device, for which these principle steps are to be taken:

Define a layout describing the sequence of commands which should be generated.
Optionally set up device-bindable shaders.
Retrieve device addresses by vkGetBufferDeviceAddressEXT for setting buffers on the device.
Fill one or more VkBuffer with the appropriate content that gets interpreted by the command layout.
Create a preprocess VkBuffer using the device-queried allocation information.
Optionally preprocess the input data in a separate action.
Generate and execute the actual commands.

The preprocessing step executes in a separate logical pipeline from either graphics or compute. When preprocessing commands in a separate step they must be explicitly synchronized against the command execution. When not preprocessing in a separate step, the preprocessing is automatically synchronized against the command execution.

Indirect Commands Layout

The device-side command generation happens through an iterative processing of an atomic sequence comprised of command tokens, which are represented by:

or:

Each indirect command layout must have exactly one action command token and it must be the last token in the sequence.

If the indirect commands layout contains only 1 token, it will be an action command token, and the contents of the indirect buffer will be a sequence of indirect command structures, similar to the ones used for indirect draws and dispatches. On some implementations, using indirect draws and dispatches for these cases will result in increased performance compared to using device-generated commands, due to the overhead that results from using the latter.

Creation and Deletion

Token Input Streams

For VK_EXT_device_generated_commands, the input streams can contain raw uint32_t values, existing indirect commands such as:

or additional commands as listed below. How the data is used is described in the next section.

For VK_NV_device_generated_commands, the input streams can contain raw uint32_t values, existing indirect commands such as:

or additional commands as listed below. How the data is used is described in the next section.

Tokenized Command Processing

The processing for VK_EXT_device_generated_commands is in principle illustrated below:

void cmdProcessSequence(cmd, indirectExecutionSet, indirectCommandsLayout, indirectAddress, s)
{
  for (t = 0; t < indirectCommandsLayout.tokenCount; t++)
  {
    uint32_t offset  = indirectCommandsLayout.pTokens[t].offset;
    uint32_t stride  = indirectCommandsLayout.indirectStride;
    VkDeviceAddress streamData = indirectAddress;
    const void* input = streamData + stride * s + offset;

    // further details later
    indirectCommandsLayout.pTokens[t].command (cmd, indirectExecutionSet, input, s);
  }
}

void cmdProcessAllSequences(cmd, indirectExecutionSet, indirectCommandsLayout, indirectAddress, sequencesCount)
{
  for (s = 0; s < sequencesCount; s++)
  {
    sUsed = s;

    if (indirectCommandsLayout.flags & VK_INDIRECT_COMMANDS_LAYOUT_USAGE_UNORDERED_SEQUENCES_BIT_EXT) {
      sUsed = incoherent_implementation_dependent_permutation[ sUsed ];
    }

    cmdProcessSequence( cmd, indirectExecutionSet, indirectCommandsLayout, indirectAddress, sUsed );
  }
}

The processing of each sequence is considered stateless, therefore all state changes must occur prior to action commands within the sequence. A single sequence is strictly targeting the VkShaderStageFlags it was created with.

The primary input data for each token is provided through VkBuffer content at preprocessing using vkCmdPreprocessGeneratedCommandsEXT or execution time using vkCmdExecuteGeneratedCommandsEXT, however some functional arguments, for example push constant layouts, are specified at layout creation time. The input size is different for each token.

void cmdProcessSequence(cmd, indirectExecutionSet, indirectCommandsLayout, indirectAddress, s)
{
  for (uint32_t t = 0; t < indirectCommandsLayout.tokenCount; t++) {
    VkIndirectCommandsLayoutTokenEXT *token = &indirectCommandsLayout.pTokens[t];

    uint32_t offset  = token->offset;
    uint32_t stride  = indirectCommandsLayout.indirectStride;
    VkDeviceAddress streamData = indirectAddress;
    const void* input = streamData + stride * s + offset;

    switch (token->tokenType) {
    case VK_INDIRECT_COMMANDS_TOKEN_TYPE_EXECUTION_SET_EXT:
      uint32_t *bind = input;
      VkIndirectCommandsExecutionSetTokenEXT *info = token->data.pExecutionSet;

      if (info->type == VK_INDIRECT_EXECUTION_SET_INFO_TYPE_PIPELINES_EXT) {
        vkCmdBindPipeline(cmd, indirectExecutionSet.pipelineBindPoint, indirectExecutionSet.pipelines[*bind]);
      } else {
        VkShaderStageFlagBits stages[];
        VkShaderEXT shaders[];
        uint32_t i = 0;
        IterateBitmaskLSBToMSB(iter, info->shaderStages) {
            stages[i] = iter;
            shaders[i] = indirectExecutionSet.shaders[bind[i]].shaderObject;
            i++;
        }
        vkCmdBindShadersEXT(cmd, i, stages, shaders);
      }
      break;

    case VK_INDIRECT_COMMANDS_TOKEN_TYPE_PUSH_CONSTANT_EXT:
      uint32_t* data = input;
      VkPushConstantsInfoKHR info = {
        VK_STRUCTURE_TYPE_PUSH_CONSTANTS_INFO_KHR,
        // this can also use `dynamicGeneratedPipelineLayout' to pass a VkPipelineLayoutCreateInfo from pNext
        indirectCommandsLayout.pipelineLayout,
        token->token.pushConstant.updateRange.shaderStages,
        token->token.pushConstant.updateRange.offset,
        token->token.pushConstant.updateRange.size,
        data
      };

      vkCmdPushConstants2KHR(cmd, &info);
      break;

    case VK_INDIRECT_COMMANDS_TOKEN_TYPE_SEQUENCE_INDEX_EXT:
      VkPushConstantsInfoKHR info = {
        VK_STRUCTURE_TYPE_PUSH_CONSTANTS_INFO_KHR,
        // this can also use `dynamicGeneratedPipelineLayout' to pass a VkPipelineLayoutCreateInfo from pNext
        indirectCommandsLayout.pipelineLayout,
        token->token.pushConstant.updateRange.shaderStages,
        token->token.pushConstant.updateRange.offset,
        // this must be 4
        token->token.pushConstant.updateRange.size,
        // this just updates the sequence index
        &s
      };

      vkCmdPushConstants2KHR(cmd, &info);
      break;

    case VK_INDIRECT_COMMANDS_TOKEN_TYPE_INDEX_BUFFER_EXT:
      VkBindIndexBufferIndirectCommandEXT* data = input;

      vkCmdBindIndexBuffer(cmd, deriveBuffer(data->bufferAddress), deriveOffset(data->bufferAddress), data->indexType);
      break;

    case VK_INDIRECT_COMMANDS_TOKEN_TYPE_VERTEX_BUFFER_EXT:
      VkBindVertexBufferIndirectCommandEXT* data = input;

      vkCmdBindVertexBuffers2(cmd, token->token.vertexBuffer->vertexBindingUnit, 1, &deriveBuffer(data->bufferAddress),
                              &deriveOffset(data->bufferAddress), data->size, data->stride);
      break;

    case VK_INDIRECT_COMMANDS_TOKEN_TYPE_DRAW_INDEXED_EXT:
      VkDrawIndexedIndirectCommand *data = input;

      vkCmdDrawIndexed(cmd, data->indexCount, data->instanceCount, data->firstIndex, data->vertexOffset, data->firstInstance);
      break;
    case VK_INDIRECT_COMMANDS_TOKEN_TYPE_DRAW_INDEXED_COUNT_EXT:
      VkDrawIndirectCountIndirectCommandEXT* data = input;

      vkCmdDrawIndexedIndirect(cmd, deriveBuffer(data->bufferAddress),  deriveoffset(data->bufferAddress), min(data->commandCount, indirectCommandsLayout.maxDrawCount), data->stride);
      break;

    case VK_INDIRECT_COMMANDS_TOKEN_TYPE_DRAW_EXT:
      VkDrawIndirectCommand* data = input;

      vkCmdDraw(cmd, data->vertex_count, data->instanceCount, data->firstVertex, data->firstIndex);
      break;

    case VK_INDIRECT_COMMANDS_TOKEN_TYPE_DRAW_COUNT_EXT:
      VkDrawIndirectCountIndirectCommandEXT* data = input;

      vkCmdDrawIndirect(cmd, deriveBuffer(data->bufferAddress), deriveoffset(data->bufferAddress), min(data->commandCount, indirectCommandsLayout.maxDrawCount), data->stride);
      break;

    // only available if VK_NV_mesh_shader is enabled
    case VK_INDIRECT_COMMANDS_TOKEN_TYPE_DRAW_MESH_TASKS_NV_EXT:
      VkDrawMeshTasksIndirectCommandNV *data = input;

      vkCmdDrawMeshTasksNV(cmd, data->taskCount, data->firstTask);
     break;

    // only available if VK_NV_mesh_shader is enabled
    case VK_INDIRECT_COMMANDS_TOKEN_TYPE_DRAW_MESH_TASKS_COUNT_NV_EXT:
      VkDrawIndirectCountIndirectCommandEXT* data = input;

      vkCmdDrawMeshTasksIndirectCountNV(cmd, deriveBuffer(data->bufferAddress),  deriveoffset(data->bufferAddress), min(data->commandCount, indirectCommandsLayout.maxDrawCount), data->stride);
      break;

    // only available if VK_EXT_mesh_shader is enabled
    case VK_INDIRECT_COMMANDS_TOKEN_TYPE_DRAW_MESH_TASKS_EXT:
      VkDrawMeshTasksIndirectCommandEXT *data = input;

      vkCmdDrawMeshTasksEXT(cmd, data->groupCountX, data->groupCountY, data->groupCountZ);
     break;

    // only available if VK_EXT_mesh_shader is enabled
    case VK_INDIRECT_COMMANDS_TOKEN_TYPE_DRAW_MESH_TASKS_COUNT_EXT:
      VkDrawIndirectCountIndirectCommandEXT* data = input;

      vkCmdDrawMeshTasksIndirectCountEXT(cmd, deriveBuffer(data->bufferAddress),  deriveoffset(data->bufferAddress), min(data->commandCount, indirectCommandsLayout.maxDrawCount), data->stride);
      break;

    case VK_INDIRECT_COMMANDS_TOKEN_TYPE_DISPATCH_EXT:
      VkDispatchIndirectCommand *data = input;

      vkCmdDispatch(cmd, data->x, data->y, data->z);
      break;

    // only available if VK_KHR_ray_tracing_maintenance1 is enabled
    case VK_INDIRECT_COMMANDS_TOKEN_TYPE_TRACE_RAYS2_EXT:
      vkCmdTraceRaysIndirect2KHR(cmd, deriveBuffer(input));
      break;
    }
  }
}

The processing for VK_NV_device_generated_commands is in principle illustrated below:

void cmdProcessSequence(cmd, pipeline, indirectCommandsLayout, pIndirectCommandsStreams, s)
{
  for (t = 0; t < indirectCommandsLayout.tokenCount; t++)
  {
    uint32_t stream  = indirectCommandsLayout.pTokens[t].stream;
    uint32_t offset  = indirectCommandsLayout.pTokens[t].offset;
    uint32_t stride  = indirectCommandsLayout.pStreamStrides[stream];
    stream            = pIndirectCommandsStreams[stream];
    const void* input = stream.buffer.pointer( stream.offset + stride * s + offset )

    // further details later
    indirectCommandsLayout.pTokens[t].command (cmd, pipeline, input, s);
  }
}

void cmdProcessAllSequences(cmd, pipeline, indirectCommandsLayout, pIndirectCommandsStreams, sequencesCount)
{
  for (s = 0; s < sequencesCount; s++)
  {
    cmdProcessSequence(cmd, pipeline, indirectCommandsLayout, pIndirectCommandsStreams, s);
  }
}

The processing of each sequence is considered stateless, therefore all state changes must occur before any action command tokens within the sequence. A single sequence is strictly targeting the VkPipelineBindPoint it was created with.

The primary input data for each token is provided through VkBuffer content at preprocessing using vkCmdPreprocessGeneratedCommandsNV or execution time using vkCmdExecuteGeneratedCommandsNV, however some functional arguments, for example binding sets, are specified at layout creation time. The input size is different for each token.

The following code provides detailed information on how an individual sequence is processed. For valid usage, all restrictions from the regular commands apply.

void cmdProcessSequence(cmd, pipeline, indirectCommandsLayout, pIndirectCommandsStreams, s)
{
  for (uint32_t t = 0; t < indirectCommandsLayout.tokenCount; t++){
    token = indirectCommandsLayout.pTokens[t];

    uint32_t stride   = indirectCommandsLayout.pStreamStrides[token.stream];
    stream            = pIndirectCommandsStreams[token.stream];
    uint32_t offset   = stream.offset + stride * s + token.offset;
    const void* input = stream.buffer.pointer( offset )

    switch(input.type){
    case VK_INDIRECT_COMMANDS_TOKEN_TYPE_SHADER_GROUP_NV:
      VkBindShaderGroupIndirectCommandNV* bind = input;

      vkCmdBindPipelineShaderGroupNV(cmd, indirectCommandsLayout.pipelineBindPoint,
        pipeline, bind->groupIndex);
    break;

    case VK_INDIRECT_COMMANDS_TOKEN_TYPE_STATE_FLAGS_NV:
      VkSetStateFlagsIndirectCommandNV* state = input;

      if (token.indirectStateFlags & VK_INDIRECT_STATE_FLAG_FRONTFACE_BIT_NV){
        if (state.data & (1 << 0)){
          set VK_FRONT_FACE_CLOCKWISE;
        } else {
          set VK_FRONT_FACE_COUNTER_CLOCKWISE;
        }
      }
    break;

    case VK_INDIRECT_COMMANDS_TOKEN_TYPE_PUSH_CONSTANT_NV:
      uint32_t* data = input;

      vkCmdPushConstants(cmd,
        token.pushconstantPipelineLayout
        token.pushconstantStageFlags,
        token.pushconstantOffset,
        token.pushconstantSize, data);
    break;

    case VK_INDIRECT_COMMANDS_TOKEN_TYPE_INDEX_BUFFER_NV:
      VkBindIndexBufferIndirectCommandNV* data = input;

      // the indexType may optionally be remapped
      // from a custom uint32_t value, via
      // VkIndirectCommandsLayoutTokenNV::pIndexTypeValues

      vkCmdBindIndexBuffer(cmd,
        deriveBuffer(data->bufferAddress),
        deriveOffset(data->bufferAddress),
        data->indexType);
    break;

    case VK_INDIRECT_COMMANDS_TOKEN_TYPE_VERTEX_BUFFER_NV:
      VkBindVertexBufferIndirectCommandNV* data = input;

      // if token.vertexDynamicStride is VK_TRUE
      // then the stride for this binding is set
      // using data->stride as well

      vkCmdBindVertexBuffers(cmd,
        token.vertexBindingUnit, 1,
        &deriveBuffer(data->bufferAddress),
        &deriveOffset(data->bufferAddress));
    break;

    case VK_INDIRECT_COMMANDS_TOKEN_TYPE_DRAW_INDEXED_NV:
      vkCmdDrawIndexedIndirect(cmd,
        stream.buffer, offset, 1, 0);
    break;

    case VK_INDIRECT_COMMANDS_TOKEN_TYPE_DRAW_NV:
      vkCmdDrawIndirect(cmd,
        stream.buffer,
        offset, 1, 0);
    break;

    // only available if VK_NV_mesh_shader is supported
    case VK_INDIRECT_COMMANDS_TOKEN_TYPE_DRAW_TASKS_NV:
      vkCmdDrawMeshTasksIndirectNV(cmd,
        stream.buffer, offset, 1, 0);
    break;

    // only available if VK_EXT_mesh_shader is supported
    case VK_INDIRECT_COMMANDS_TOKEN_TYPE_DRAW_MESH_TASKS_NV:
      vkCmdDrawMeshTasksIndirectEXT(cmd,
        stream.buffer, offset, 1, 0);
    break;

    case VK_INDIRECT_COMMANDS_TOKEN_TYPE_PIPELINE_NV:
      VkBindPipelineIndirectCommandNV *data = input;
      VkPipeline computePipeline = deriveFromDeviceAddress(data->pipelineAddress);
      vkCmdBindPipeline(cmd, VK_PIPELINE_BIND_POINT_COMPUTE, computePipeline);
    break;

    case VK_INDIRECT_COMMANDS_TOKEN_TYPE_DISPATCH_NV:
      vkCmdDispatchIndirect(cmd, stream.buffer, offset);
    break;
    }
  }
}

Indirect Commands Generation and Execution

The generation of commands on the device requires a preprocess buffer.

With VK_NV_device_generated_commands, to bind a compute pipeline in Device-Generated Commands, an application must query the pipeline’s device address.

Indirect Execution Sets

It is legal to update an Indirect Execution Set that is in flight as long as the element indices in pExecutionSetWrites are not in use. Any change to an indirect execution set requires recalculating memory requirements by calling vkGetGeneratedCommandsMemoryRequirementsEXT for commands that use that modified state. Commands that are in flight or those not using updated elements require no changes.

The lifetimes of pipelines and shader objects contained in a set must match or exceed the lifetime of the set.

Referencing the functions defined in Indirect Commands Layout, vkCmdExecuteGeneratedCommandsNV behaves as:

uint32_t sequencesCount = sequencesCountBuffer ?
      min(maxSequencesCount, sequencesCountBuffer.load_uint32(sequencesCountOffset) :
      maxSequencesCount;


cmdProcessAllSequences(commandBuffer, pipeline,
                       indirectCommandsLayout, pIndirectCommandsStreams,
                       sequencesCount,
                       sequencesIndexBuffer, sequencesIndexOffset);

// The stateful commands within indirectCommandsLayout will not
// affect the state of subsequent commands in the target
// command buffer (cmd)

It is important to note that the values of all state related to the pipelineBindPoint used are undefined: after this command.

The bound descriptor sets and push constants that will be used with indirect command generation for the compute pipelines must already be specified at the time of preprocessing commands with vkCmdPreprocessGeneratedCommandsNV. They must not change until the execution of indirect commands is submitted with vkCmdExecuteGeneratedCommandsNV.

If push constants for the compute pipeline are also specified in the VkGeneratedCommandsInfoNV::indirectCommandsLayout with VK_INDIRECT_COMMANDS_TOKEN_TYPE_PUSH_CONSTANT_NV token, then those values override the push constants that were previously pushed for the compute pipeline.

Referencing the functions defined in Indirect Commands Layout, vkCmdExecuteGeneratedCommandsEXT behaves as:

uint32_t sequencesCount = sequenceCountAddress ?
      min(maxSequenceCount, sequenceCountAddress.load_uint32()) :
      maxSequenceCount;


cmdProcessAllSequences(commandBuffer, indirectExecutionSet,
                       indirectCommandsLayout, indirectAddress,
                       sequencesCount);

// The stateful commands within indirectCommandsLayout will not
// affect the state of subsequent commands in the target
// command buffer (cmd)

It is important to note that the affected values of all state related to the shaderStages used are undefined: after this command. This means that e.g., if this command indirectly alters push constants, the push constant state becomes undefined:.

The bound descriptor sets and push constants that will be used with indirect command generation must already be specified on stateCommandBuffer at the time of preprocessing commands with vkCmdPreprocessGeneratedCommandsEXT. They must match the bound descriptor sets and push constants used in the execution of indirect commands with vkCmdExecuteGeneratedCommandsEXT.

If push constants for shader stages are also specified in the VkGeneratedCommandsInfoEXT::indirectCommandsLayout with a VK_INDIRECT_COMMANDS_TOKEN_TYPE_PUSH_CONSTANT_EXT or VK_INDIRECT_COMMANDS_TOKEN_TYPE_SEQUENCE_INDEX_EXT token, then those values override the push constants that were previously pushed.

All state bound on stateCommandBuffer will be used. All state bound on stateCommandBuffer must be identical to the state bound at the time vkCmdExecuteGeneratedCommandsEXT is recorded. The queue family index stateCommandBuffer was allocated from must be the same as the queue family index of the command buffer used in vkCmdExecuteGeneratedCommandsEXT.

On some implementations, preprocessing may have no effect on performance.

vkCmdExecuteGeneratedCommandsEXT may write to the preprocess buffer, no matter the isPreprocess parameter. In this case, the implementation must insert appropriate synchronization automatically, which corresponds to the following pseudocode:

Barrier
- srcStageMask = DRAW_INDIRECT
- srcAccesMask = 0
- dstStageMask = COMMAND_PREPROCESS_BIT
- dstAccessMask = COMMAND_PREPROCESS_WRITE_BIT | COMMAND_PREPROCESS_READ_BIT
Do internal writes
Barrier
- srcStageMask = COMMAND_PREPROCESS_BIT
- srcAccesMask = COMMAND_PREPROCESS_WRITE_BIT
- dstStageMask = DRAW_INDIRECT
- dstAccessMask = INDIRECT_COMMAND_READ_BIT
Execute