Shaders

A shader specifies programmable operations that execute for each vertex, control point, tessellated vertex, primitive, fragment, or workgroup in the corresponding stage(s) of the graphics and compute pipelines.

Graphics pipelines include vertex shader execution as a result of primitive assembly, followed, if enabled, by tessellation control and evaluation shaders operating on patches, geometry shaders, if enabled, operating on primitives, and fragment shaders, if present, operating on fragments generated by Rasterization. In this specification, vertex, tessellation control, tessellation evaluation and geometry shaders are collectively referred to as pre-rasterization shader stages and occur in the logical pipeline before rasterization. The fragment shader occurs logically after rasterization.

Only the compute shader stage is included in a compute pipeline. Compute shaders operate on compute invocations in a workgroup.

Shaders can read from input variables, and read from and write to output variables. Input and output variables can be used to transfer data between shader stages, or to allow the shader to interact with values that exist in the execution environment. Similarly, the execution environment provides constants describing capabilities.

Shader variables are associated with execution environment-provided inputs and outputs using built-in decorations in the shader. The available decorations for each stage are documented in the following subsections.

Shader Objects

Shaders may be compiled and linked into pipeline objects as described in Pipelines chapter, or if the shaderObject feature is enabled they may be compiled into individual per-stage shader objects which can be bound on a command buffer independently from one another. Unlike pipelines, shader objects are not intrinsically tied to any specific set of state. Instead, state is specified dynamically in the command buffer.

Each shader object represents a single compiled shader stage, which mayoptionally be linked with one or more other stages.

VkShaderEXTOpaque handle to a shader object

Shader Object Creation

Shader objects may be created from shader code provided as SPIR-V, or in an opaque, implementation-defined binary format specific to the physical device.

vkCreateShadersEXTCreate one or more new shaders
VkShaderCreateInfoEXTStructure specifying parameters of a newly created shader
VkShaderCreateFlagsEXTBitmask of VkShaderCreateFlagBitsEXT
VkShaderCreateFlagBitsEXTBitmask controlling how a shader object is created
The behavior of VK_SHADER_CREATE_FRAGMENT_SHADING_RATE_ATTACHMENT_BIT_EXT and VK_SHADER_CREATE_FRAGMENT_DENSITY_MAP_ATTACHMENT_BIT_EXT differs subtly from the behavior of VK_PIPELINE_CREATE_RENDERING_FRAGMENT_SHADING_RATE_ATTACHMENT_BIT_KHR and VK_PIPELINE_CREATE_RENDERING_FRAGMENT_DENSITY_MAP_ATTACHMENT_BIT_EXT in that the shader bit allows, but does not require the shader to be used with that type of attachment. This means that the application need not create multiple shaders when it does not know in advance whether the shader will be used with or without the attachment type, or when it needs the same shader to be compatible with usage both with and without. This may come at some performance cost on some implementations, so applications should still only set bits that are actually necessary.
VkShaderCodeTypeEXTIndicate a shader code type

Binary Shader Code

vkGetShaderBinaryDataEXTGet the binary shader code from a shader object

Binary Shader Compatibility

Binary shader compatibility means that binary shader code returned from a call to vkGetShaderBinaryDataEXT can be passed to a later call to vkCreateShadersEXT, potentially on a different logical and/or physical device, and that this will result in the successful creation of a shader object functionally equivalent to the shader object that the code was originally queried from.

Binary shader code queried from vkGetShaderBinaryDataEXT is not guaranteed to be compatible across all devices, but implementations are required to provide some compatibility guarantees. Applications may determine binary shader compatibility using either (or both) of two mechanisms.

Guaranteed compatibility of shader binaries is expressed through a combination of the shaderBinaryUUID and shaderBinaryVersion members of the VkPhysicalDeviceShaderObjectPropertiesEXT structure queried from a physical device. Binary shaders retrieved from a physical device with a certain shaderBinaryUUID are guaranteed to be compatible with all other physical devices reporting the same shaderBinaryUUID and the same or higher shaderBinaryVersion.

Whenever a new version of an implementation incorporates any changes that affect the output of vkGetShaderBinaryDataEXT, the implementation should either increment shaderBinaryVersion if binary shader code retrieved from older versions remains compatible with the new implementation, or else replace shaderBinaryUUID with a new value if backward compatibility has been broken. Binary shader code queried from a device with a matching shaderBinaryUUID and lower shaderBinaryVersion relative to the device on which vkCreateShadersEXT is being called may be suboptimal for the new device in ways that do not change shader functionality, but it is still guaranteed to be usable to successfully create the shader object(s).

Implementations are encouraged to share shaderBinaryUUID between devices and driver versions to the maximum extent their hardware naturally allows, and are strongly discouraged from ever changing the shaderBinaryUUID for the same hardware except unless absolutely necessary.

In addition to the shader compatibility guarantees described above, it is valid for an application to call vkCreateShadersEXT with binary shader code created on a device with a different or unknown shaderBinaryUUID and/or higher shaderBinaryVersion. In this case, the implementation may use any unspecified means of its choosing to determine whether the provided binary shader code is usable. If it is, vkCreateShadersEXT must return VK_SUCCESS, and the created shader object is guaranteed to be valid. Otherwise, in the absence of some error, vkCreateShadersEXT must return VK_INCOMPATIBLE_SHADER_BINARY_EXT to indicate that the provided binary shader code is not compatible with the device.

Binding Shader Objects

vkCmdBindShadersEXTBind shader objects to a command buffer

Setting State

Whenever shader objects are used to issue drawing commands, the appropriate dynamic state setting commands must have been called to set the relevant state in the command buffer prior to drawing:

If a shader is bound to the VK_SHADER_STAGE_VERTEX_BIT stage, the following commands must have been called in the command buffer prior to drawing:

If a shader is bound to the VK_SHADER_STAGE_TESSELLATION_CONTROL_BIT stage, the following command must have been called in the command buffer prior to drawing:

If a shader is bound to the VK_SHADER_STAGE_TESSELLATION_EVALUATION_BIT stage, the following command must have been called in the command buffer prior to drawing:

If rasterizerDiscardEnable is VK_FALSE, the following commands must have been called in the command buffer prior to drawing:

If a shader is bound to the VK_SHADER_STAGE_FRAGMENT_BIT stage, and rasterizerDiscardEnable is VK_FALSE, the following commands must have been called in the command buffer prior to drawing:

If the pipelineFragmentShadingRate feature is enabled, and rasterizerDiscardEnable is VK_FALSE, the following command must have been called in the command buffer prior to drawing:

If the geometryStreams feature is enabled, and a shader is bound to the VK_SHADER_STAGE_GEOMETRY_BIT stage, the following command must have been called in the command buffer prior to drawing:

If the VK_EXT_discard_rectangles extension is enabled, and rasterizerDiscardEnable is VK_FALSE, the following commands must have been called in the command buffer prior to drawing:

If the VK_EXT_conservative_rasterization extension is enabled, and rasterizerDiscardEnable is VK_FALSE, the following commands must have been called in the command buffer prior to drawing:

If the depthClipEnable feature is enabled, the following command must have been called in the command buffer prior to drawing:

If the VK_EXT_sample_locations extension is enabled, and rasterizerDiscardEnable is VK_FALSE, the following commands must have been called in the command buffer prior to drawing:

If the VK_EXT_provoking_vertex extension is enabled, and rasterizerDiscardEnable is VK_FALSE, and a shader is bound to the VK_SHADER_STAGE_VERTEX_BIT stage, the following command must have been called in the command buffer prior to drawing:

If any of the stippledRectangularLines, stippledBresenhamLines, or stippledSmoothLines features are enabled, and rasterizerDiscardEnable is VK_FALSE, and if the effective rasterization input topology is in line topology class, the following commands must have been called in the command buffer prior to drawing:

If the depthClipControl feature is enabled, the following command must have been called in the command buffer prior to drawing:

If the colorWriteEnable feature is enabled, and a shader is bound to the VK_SHADER_STAGE_FRAGMENT_BIT stage, and rasterizerDiscardEnable is VK_FALSE, the following command must have been called in the command buffer prior to drawing:

If the attachmentFeedbackLoopDynamicState feature is enabled, and a shader is bound to the VK_SHADER_STAGE_FRAGMENT_BIT stage, and rasterizerDiscardEnable is VK_FALSE, the following command must have been called in the command buffer prior to drawing:

If the VK_NV_clip_space_w_scaling extension is enabled, the following commands must have been called in the command buffer prior to drawing:

If the depthClamp and depthClampControl features are enabled, and depthClampEnable is VK_TRUE, the following command must have been called in the command buffer prior to drawing:

If the VK_NV_viewport_swizzle extension is enabled, the following command must have been called in the command buffer prior to drawing:

If the VK_NV_fragment_coverage_to_color extension is enabled, and rasterizerDiscardEnable is VK_FALSE, the following commands must have been called in the command buffer prior to drawing:

If the VK_NV_framebuffer_mixed_samples extension is enabled, and rasterizerDiscardEnable is VK_FALSE, the following commands must have been called in the command buffer prior to drawing:

If the coverageReductionMode feature is enabled, and rasterizerDiscardEnable is VK_FALSE, the following command must have been called in the command buffer prior to drawing:

If the representativeFragmentTest feature is enabled, and rasterizerDiscardEnable is VK_FALSE, the following command must have been called in the command buffer prior to drawing:

If the shadingRateImage feature is enabled, and rasterizerDiscardEnable is VK_FALSE, the following commands must have been called in the command buffer prior to drawing:

If the exclusiveScissor feature is enabled, the following commands must have been called in the command buffer prior to drawing:

State can be set either at any time before or after shader objects are bound, but all required state must be set prior to issuing drawing commands.

If the commandBufferInheritance feature is enabled, graphics and compute state is inherited from the previously executed command buffer in the queue. Any valid state inherited in this way does not need to be set again in the current command buffer.

Interaction With Pipelines

Calling vkCmdBindShadersEXT causes the pipeline bind points corresponding to each stage in pStages to be disturbed, meaning that any pipelines that had previously been bound to those pipeline bind points are no longer bound.

If VK_PIPELINE_BIND_POINT_GRAPHICS is disturbed (i.e., if pStages contains any graphics stage), any graphics pipeline state that the previously bound pipeline did not specify as dynamic becomes undefined, and must be set in the command buffer before issuing drawing commands using shader objects.

Calls to vkCmdBindPipeline likewise disturb the shader stage(s) corresponding to pipelineBindPoint, meaning that any shaders that had previously been bound to any of those stages are no longer bound, even if the pipeline was created without shaders for some of those stages.

Shader Object Destruction

vkDestroyShaderEXTDestroy a shader object

Shader Modules

VkShaderModuleOpaque handle to a shader module object
vkCreateShaderModuleCreates a new shader module object
VkShaderModuleCreateInfoStructure specifying parameters of a newly created shader module
VkShaderModuleCreateFlagsReserved for future use
VkShaderModuleValidationCacheCreateInfoEXTSpecify validation cache to use during shader module creation
vkDestroyShaderModuleDestroy a shader module

Shader Module Identifiers

vkGetShaderModuleIdentifierEXTQuery a unique identifier for a shader module
vkGetShaderModuleCreateInfoIdentifierEXTQuery a unique identifier for a shader module create info
VkShaderModuleIdentifierEXTA unique identifier for a shader module
VK_MAX_SHADER_MODULE_IDENTIFIER_SIZE_EXTMaximum length of a shader module identifier

Binding Shaders

Before a shader can be used it must be first bound to the command buffer.

Calling vkCmdBindPipeline binds all stages corresponding to the VkPipelineBindPoint. Calling vkCmdBindShadersEXT binds all stages in pStages

The following table describes the relationship between shader stages and pipeline bind points:

Shader stagePipeline bind pointbehavior controlled
  • VK_SHADER_STAGE_VERTEX_BIT
  • VK_SHADER_STAGE_TESSELLATION_CONTROL_BIT
  • VK_SHADER_STAGE_TESSELLATION_EVALUATION_BIT
  • VK_SHADER_STAGE_GEOMETRY_BIT
  • VK_SHADER_STAGE_FRAGMENT_BIT
  • VK_SHADER_STAGE_TASK_BIT_EXT
  • VK_SHADER_STAGE_MESH_BIT_EXT

VK_PIPELINE_BIND_POINT_GRAPHICS

all drawing commands

  • VK_SHADER_STAGE_COMPUTE_BIT

VK_PIPELINE_BIND_POINT_COMPUTE

all dispatch commands

  • VK_SHADER_STAGE_ANY_HIT_BIT_KHR
  • VK_SHADER_STAGE_CALLABLE_BIT_KHR
  • VK_SHADER_STAGE_CLOSEST_HIT_BIT_KHR
  • VK_SHADER_STAGE_INTERSECTION_BIT_KHR
  • VK_SHADER_STAGE_MISS_BIT_KHR
  • VK_SHADER_STAGE_RAYGEN_BIT_KHR

VK_PIPELINE_BIND_POINT_RAY_TRACING_KHR

vkCmdTraceRaysNVvkCmdTraceRaysKHR and vkCmdTraceRaysIndirectKHR

  • VK_SHADER_STAGE_SUBPASS_SHADING_BIT_HUAWEI
  • VK_SHADER_STAGE_CLUSTER_CULLING_BIT_HUAWEI

VK_PIPELINE_BIND_POINT_SUBPASS_SHADING_HUAWEI

vkCmdSubpassShadingHUAWEI

  • VK_SHADER_STAGE_COMPUTE_BIT

VK_PIPELINE_BIND_POINT_EXECUTION_GRAPH_AMDX

all execution graph commands

Shader Execution

At each stage of the pipeline, multiple invocations of a shader may execute simultaneously. Further, invocations of a single shader produced as the result of different commands may execute simultaneously. The relative execution order of invocations of the same shader type is undefined. Shader invocations may complete in a different order than that in which the primitives they originated from were drawn or dispatched by the application. However, fragment shader outputs are written to attachments in rasterization order.

The relative execution order of invocations of different shader types is largely undefined. However, when invoking a shader whose inputs are generated from a previous pipeline stage, the shader invocations from the previous stage are guaranteed to have executed far enough to generate input values for all required inputs.

Shader Termination

A shader invocation that is terminated has finished executing instructions.

Executing OpReturn in the entry point, or executing OpTerminateInvocation in any function will terminate an invocation. Implementations may also terminate a shader invocation when OpKill is executed in any function; otherwise it becomes a helper invocation.

In addition to the above conditions, helper invocations may be terminated when all non-helper invocations in the same derivative group either terminate or become helper invocations.

A shader stage for a given command completes execution when all invocations for that stage have terminated.

Depending on the implementation, OpKill will be functionally equivalent to either OpTerminateInvocation or OpDemoteToHelperInvocation. To obtain the most predictable behavior, shader authors should use OpTerminateInvocation or OpDemoteToHelperInvocation rather than OpKill wherever possible.

Shader Out-of-Bounds Memory Access

Shader accesses to memory are not automatically bounds checked by the implementation. Applications must not execute operations that would access out of bounds memory locations unless some form of bounds checking is enabled. An access is considered out of bounds if any part of the access is outside of any specified memory range, whether that is the array length specified in a shader or a range specified in the API (e.g. descriptor size).

External tooling such as the Vulkan Validation Layers can be used to help validate that accesses are not out of bounds.

An access can be independently out of bounds for each range that applies; if one is bounds checked and the others are not, behavior is still undefined.

For example, given the following shader declaration
// Buffer type
struct MySSBO {
    uint32_t data[2];
};
accessing data at an index greater than 1 is undefined behavior, whether the underlying buffer is bigger than that or not.

Vulkan provides functionality that enables automatic bounds checking in some cases, as outlined below.

Automatic bounds checking can be used to ensure that accesses outside of certain bounds have predictable results, acting as a safety net for untrusted code, or simply as a way for applications to avoid their own bounds checks. While there may be a performance cost for enabling these features, they should not be slower than an application performing equivalent checks. Automatic checks do not necessarily account for all possible bounds - e.g. Robust Buffer Access will not prevent undefined behavior in the buffer access example in the prior note.

Robust Buffer Access

Robust buffer access can be enabled by specifying VK_PIPELINE_ROBUSTNESS_BUFFER_BEHAVIOR_ROBUST_BUFFER_ACCESS in VkPipelineRobustnessCreateInfo, or specifying VK_PIPELINE_ROBUSTNESS_BUFFER_BEHAVIOR_DEVICE_DEFAULT and enabling the the robustBufferAccess feature.

When robust buffer access is enabled, access to a buffer via a descriptor is bounds checked against the range specified for the descriptor, and access to vertex input data is bounds checked against the bound vertex buffer range. Reads from a vertex input may instead be bounds checked against a range rounded down to the nearest multiple of the stride of its binding.

The range of a descriptor is not necessarily equivalent to the size of the underlying resource; applications may suballocate descriptors from larger buffers, for instance. The APIs specifying the descriptor range vary between resource types and descriptor interfaces, but for example include the ranges specified by VkDescriptorBufferInfo or VkBufferViewCreateInfo.

If any vertex input read is outside of the checked range, all other vertex input reads through the same binding in the same shader invocation may behave as if they were outside of the checked range.

If any access to a uniform, storage, uniform texel, or storage texel buffer is outside of the checked range, any access of the same type (write, read-modify-write, or read) to the same buffer that is less than 16 bytes away from the first access may behave as if it is also outside of the checked range.

Any non-atomic access to a uniform, storage, uniform texel, or storage texel buffer wider than 32-bits may be treated as multiple 32-bit accesses that are separately bounds checked.

Writes to a storage or storage texel buffer outside of the checked range will either be discarded, or modify values within the memory range(s) bound to the underlying buffer (including outside of the checked range). They will not modify any other memory.

Non-atomic writes outside of the checked range can lead to data races, as the application has no control over where the data will be written.

Atomic read-modify-write operations to a storage or storage texel buffer outside of the checked range will behave the same as a write outside of the checked range, but will return an undefined value.

Reading a uniform, storage, uniform texel, or storage texel buffer outside of the checked range will return one of the following values:

  • Values from anywhere within the memory range(s) bound to the underlying buffer object, which may include bytes beyond the size of the buffer itself.
  • Zero values
  • For 4-component vectors, a value of (0,0,0,x), where x is any of
    • 0, 1, or the maximum positive integer value for integer components
    • 0.0 or 1.0 for floating-point components
  • The value of the last store to the same out-of-bounds location in the same shader invocation.
    • Using the Volatile/VolatileTexel memory/image operand, the Volatile memory semantic, or the Volatile decoration to load the value will prevent prior stored values from being returned.
Getting the value of the previous store is possible as implementations are free to optimize multiple accesses in the general case. There are several ways this can be prevented, but using volatile loads is by far the simplest.

Reads from a vertex input outside of the checked range will produce one of the following values:

  • Values from anywhere within the memory range(s) bound to the underlying buffer object, which may include bytes beyond the size of the buffer itself, converted via input extraction.
  • Zero values, converted via input extraction.
  • Zero values
  • For 4-component vectors, a value of (0,0,0,x), where x is any of
    • 0, 1, or the maximum positive integer value for integer components
    • 0.0 or 1.0 for floating-point components

Accesses via OpCooperativeMatrixLoadNV and OpCooperativeMatrixStoreNV are only bounds checked in the above manner if the VkPhysicalDeviceCooperativeMatrixFeaturesNV::cooperativeMatrixRobustBufferAccess feature is enabled.

Accesses via OpCooperativeMatrixLoadKHR and OpCooperativeMatrixStoreKHR are only bounds checked in the above manner if the VkPhysicalDeviceCooperativeMatrixFeaturesKHR::cooperativeMatrixRobustBufferAccess feature is enabled.

Accesses using OpCooperativeVector* instructions are not bounds-checked.

Robust Buffer Access 2

Robust buffer access 2 can be enabled by specifying VK_PIPELINE_ROBUSTNESS_BUFFER_BEHAVIOR_ROBUST_BUFFER_ACCESS_2 in VkPipelineRobustnessCreateInfo, or specifying VK_PIPELINE_ROBUSTNESS_BUFFER_BEHAVIOR_DEVICE_DEFAULT and enabling the the robustBufferAccess2 feature.

When robust buffer access 2 is enabled, access to a buffer via a descriptor is bounds checked against the range specified for the descriptor, and access to vertex input data is bounds checked against the bound vertex buffer range, similarly to Robust Buffer Access, but with tighter bounds on the results.

Accesses to a uniform buffer may instead be bounds checked against a range rounded up to robustUniformBufferAccessSizeAlignment. Accesses inside the aligned range may behave as if they are in bounds, even if they are outside of the unaligned descriptor range, and access memory accordingly. The same is true for accesses to a storage buffer, using the robustStorageBufferAccessSizeAlignment limit instead.

To avoid unexpected data races between neighboring descriptor ranges, applications may wish to ensure suballocated ranges of buffers are aligned to these limits.

Any access to a uniform, storage, uniform texel, or storage texel buffer wider than 32-bits may be treated as multiple 32-bit accesses that are separately bounds checked.

Accesses to null descriptors are not considered out-of-bounds and have separate behavior controlled by the nullDescriptor feature.

Writes to a storage or storage texel buffer outside of the checked range will not modify any memory.

Atomic read-modify-write operations to a storage or storage texel buffer outside of the checked range will behave the same as a write outside of the checked range, but will return an undefined value.

Reads from a uniform or storage buffer outside of the checked range will return zero values. If a value was previously written to the same out of bounds location in the same shader invocation, that value may be returned instead; using the Volatile/VolatileTexel memory/image operand, the Volatile memory semantic, or the Volatile decoration to load the value will prevent prior stored values from being returned.

Reading a uniform texel or storage texel buffer outside of the checked range will produce zero values, but component substitution will still be applied based on the buffer view’s format, with the resulting value returned to the shader. If a value was previously written to the same out of bounds location in the same shader invocation, that value may be returned instead; using the Volatile/VolatileTexel memory/image operand, the Volatile memory semantic, or the Volatile decoration to load the value will prevent prior stored values from being returned.

Reads from a vertex input outside of the checked range will produce zero values, but input extraction will still be applied, filling missing G, B, or A components with (0,0,1).

Accesses via OpCooperativeMatrixLoadNV and OpCooperativeMatrixStoreNV are only bounds checked in the above manner if the VkPhysicalDeviceCooperativeMatrixFeaturesNV::cooperativeMatrixRobustBufferAccess feature is enabled.

Accesses via OpCooperativeMatrixLoadKHR and OpCooperativeMatrixStoreKHR are only bounds checked in the above manner if the VkPhysicalDeviceCooperativeMatrixFeaturesKHR::cooperativeMatrixRobustBufferAccess feature is enabled.

Accesses using OpCooperativeVector* instructions are not bounds-checked.

Image Sampling

Sampling operations on an image descriptor are always well-defined when coordinates exceeding the dimensions specified for the descriptor are accessed, as described in the Wrapping Operation section.

Robust Image Access

Robust image access can be enabled by specifying VK_PIPELINE_ROBUSTNESS_IMAGE_BEHAVIOR_ROBUST_IMAGE_ACCESS in VkPipelineRobustnessCreateInfo, or specifying VK_PIPELINE_ROBUSTNESS_IMAGE_BEHAVIOR_DEVICE_DEFAULT and enabling the the robustImageAccess feature.

If robust image access is enabled, accesses to image descriptors are bounds checked against the image view dimensions specified for the descriptor.

Writes or atomic read-modify-write operations to a storage image outside of the checked dimensions will not modify any memory.

Reads, atomic read-modify-write operations, or fetches from images outside of the checked dimensions will return zero values, with (0,0,1) or (0,0,0) values inserted for missing G, B, or A components based on the format.

If a value was previously written to the same out of bounds location in the same shader invocation, that value may be returned instead; using the VolatileTexel image operand, the Volatile memory semantic, or the Volatile decoration to load the value will prevent prior stored values from being returned.

Robust Image Access 2

This is largely identical to Robust Image Access; the only difference being that the alpha channel must be replaced with 1, rather than 1 or 0, for out of bounds texel access.

Robust image access 2 can be enabled by specifying VK_PIPELINE_ROBUSTNESS_IMAGE_BEHAVIOR_ROBUST_IMAGE_ACCESS_2 in VkPipelineRobustnessCreateInfo, or specifying VK_PIPELINE_ROBUSTNESS_IMAGE_BEHAVIOR_DEVICE_DEFAULT and enabling the the robustImageAccess2 feature.

If robust image access 2 is enabled, accesses to image descriptors are bounds checked against the image view dimensions specified for the descriptor.

Writes or atomic read-modify-write operations to a storage image outside of the checked dimensions will not modify any memory.

Reads, atomic read-modify-write operations, or fetches from images outside of the checked dimensions will return zero values, with (0,0,1) values inserted for missing G, B, or A components based on the format.

If a value was previously written to the same out of bounds location in the same shader invocation, that value may be returned instead; using the VolatileTexel image operand, the Volatile memory semantic, or the Volatile decoration to load the value will prevent prior stored values from being returned.

Shader Memory Access Ordering

The order in which image or buffer memory is read or written by shaders is largely undefined. For some shader types (vertex, tessellation evaluation, and in some cases, fragment), even the number of shader invocations that may perform loads and stores is undefined.

In particular, the following rules apply:

  • Vertex and tessellation evaluation shaders will be invoked at least once for each unique vertex, as defined in those sections.
  • Fragment shaders will be invoked zero or more times, as defined in that section.
  • The relative execution order of invocations of the same shader type is undefined. A store issued by a shader when working on primitive B might complete prior to a store for primitive A, even if primitive A is specified prior to primitive B. This applies even to fragment shaders; while fragment shader outputs are always written to the framebuffer in rasterization order, stores executed by fragment shader invocations are not.
  • The relative execution order of invocations of different shader types is largely undefined.
The above limitations on shader invocation order make some forms of synchronization between shader invocations within a single set of primitives unimplementable. For example, having one invocation poll memory written by another invocation assumes that the other invocation has been launched and will complete its writes in finite time.

The Memory Model appendix defines the terminology and rules for how to correctly communicate between shader invocations, such as when a write is Visible-To a read, and what constitutes a Data Race. Applications must not cause a data race.

Shader Inputs and Outputs

Data is passed into and out of shaders using variables with input or output storage class, respectively. User-defined inputs and outputs are connected between stages by matching their Location decorations. Additionally, data can be provided by or communicated to special functions provided by the execution environment using BuiltIn decorations.

In many cases, the same BuiltIn decoration can be used in multiple shader stages with similar meaning. The specific behavior of variables decorated as BuiltIn is documented in the following sections.

Task Shaders

Task shaders operate in conjunction with the mesh shaders to produce a collection of primitives that will be processed by subsequent stages of the graphics pipeline. Its primary purpose is to create a variable amount of subsequent mesh shader invocations.

Task shaders are invoked via the execution of the programmable mesh shading pipeline.

The task shader has no fixed-function inputs other than variables identifying the specific workgroup and invocation. In the TaskNV Execution Model the number of mesh shader workgroups to create is specified via a TaskCountNV decorated output variable. In the TaskEXT Execution Model the number of mesh shader workgroups to create is specified via the OpEmitMeshTasksEXT instruction.

The task shader can write additional outputs to task memory, which can be read by all of the mesh shader workgroups it created.

Task Shader Execution

Task workloads are formed from groups of work items called workgroups and processed by the task shader in the current graphics pipeline. A workgroup is a collection of shader invocations that execute the same shader, potentially in parallel. Task shaders execute in global workgroups which are divided into a number of local workgroups with a size that can be set by assigning a value to the LocalSize or LocalSizeId execution mode or via an object decorated by the WorkgroupSize decoration. An invocation within a local workgroup can share data with other members of the local workgroup through shared variables and issue memory and control flow barriers to synchronize with other members of the local workgroup. If the subpass includes multiple views in its view mask, a Task shader using TaskEXT Execution Model may be invoked separately for each view.

Mesh Shaders

Mesh shaders operate in workgroups to produce a collection of primitives that will be processed by subsequent stages of the graphics pipeline. Each workgroup emits zero or more output primitives and the group of vertices and their associated data required for each output primitive.

Mesh shaders are invoked via the execution of the programmable mesh shading pipeline.

The only inputs available to the mesh shader are variables identifying the specific workgroup and invocation and, if applicable, any outputs written to task memory by the task shader that spawned the mesh shader’s workgroup. The mesh shader can operate without a task shader as well.

The invocations of the mesh shader workgroup write an output mesh, comprising a set of primitives with per-primitive attributes, a set of vertices with per-vertex attributes, and an array of indices identifying the mesh vertices that belong to each primitive. The primitives of this mesh are then processed by subsequent graphics pipeline stages, where the outputs of the mesh shader form an interface with the fragment shader.

Mesh Shader Execution

Mesh workloads are formed from groups of work items called workgroups and processed by the mesh shader in the current graphics pipeline. A workgroup is a collection of shader invocations that execute the same shader, potentially in parallel. Mesh shaders execute in global workgroups which are divided into a number of local workgroups with a size that can be set by assigning a value to the LocalSize or LocalSizeId execution mode or via an object decorated by the WorkgroupSize decoration. An invocation within a local workgroup can share data with other members of the local workgroup through shared variables and issue memory and control flow barriers to synchronize with other members of the local workgroup.

The global workgroups may be generated explicitly via the API, or implicitly through the task shader’s work creation mechanism. If the subpass includes multiple views in its view mask, a Mesh shader using MeshEXT Execution Model may be invoked separately for each view.

Cluster Culling Shaders

Cluster Culling shaders are invoked via the execution of the Programmable Cluster Culling Shading pipeline.

The only inputs available to the cluster culling shader are variables identifying the specific workgroup and invocation.

Cluster Culling shaders operate in workgroups to perform cluster-based culling and produce zero or more cluster drawing command that will be processed by subsequent stages of the graphics pipeline.

The Cluster Drawing Command (CDC) is very similar to the MDI command, invocations in workgroup can emit zero of more CDC to draw zero or more visible cluster.

Cluster Culling Shader Execution

Cluster Culling workloads are formed from groups of work items called workgroups and processed by the cluster culling shader in the current graphics pipeline. A workgroup is a collection of shader invocations that execute the same shader, potentially in parallel. Cluster Culling shaders execute in global workgroups which are divided into a number of local workgroups with a size that can be set by assigning a value to the LocalSize or LocalSizeId execution mode or via an object decorated by the WorkgroupSize decoration. An invocation within a local workgroup can share data with other members of the local workgroup through shared variables and issue memory and control flow barriers to synchronize with other members of the local workgroup.

Vertex Shaders

Each vertex shader invocation operates on one vertex and its associated vertex attribute data, and outputs one vertex and associated data. Graphics pipelines using primitive shading must include a vertex shader, and the vertex shader stage is always the first shader stage in the graphics pipeline.

Vertex Shader Execution

A vertex shader must be executed at least once for each vertex specified by a drawing command. If the subpass includes multiple views in its view mask, the shader may be invoked separately for each view. During execution, the shader is presented with the index of the vertex and instance for which it has been invoked. Input variables declared in the vertex shader are filled by the implementation with the values of vertex attributes associated with the invocation being executed.

If the same vertex is specified multiple times in a drawing command (e.g. by including the same index value multiple times in an index buffer) the implementation may reuse the results of vertex shading if it can statically determine that the vertex shader invocations will produce identical results.

It is implementation-dependent when and if results of vertex shading are reused, and thus how many times the vertex shader will be executed. This is true also if the vertex shader contains stores or atomic operations (see vertexPipelineStoresAndAtomics).

Tessellation Control Shaders

The tessellation control shader is used to read an input patch provided by the application and to produce an output patch. Each tessellation control shader invocation operates on an input patch (after all control points in the patch are processed by a vertex shader) and its associated data, and outputs a single control point of the output patch and its associated data, and can also output additional per-patch data. The input patch is sized according to the patchControlPoints member of VkPipelineTessellationStateCreateInfo, as part of input assembly.

The input patch can also be dynamically sized with patchControlPoints parameter of vkCmdSetPatchControlPointsEXT.

vkCmdSetPatchControlPointsEXTSpecify the number of control points per patch dynamically for a command buffer

The size of the output patch is controlled by the OpExecutionMode

OutputVertices specified in the tessellation control or tessellation evaluation shaders, which must be specified in at least one of the shaders. The size of the input and output patches must each be greater than zero and less than or equal to VkPhysicalDeviceLimits::maxTessellationPatchSize.

Tessellation Control Shader Execution

A tessellation control shader is invoked at least once for each output vertex in a patch. If the subpass includes multiple views in its view mask, the shader may be invoked separately for each view.

Inputs to the tessellation control shader are generated by the vertex shader. Each invocation of the tessellation control shader can read the attributes of any incoming vertices and their associated data. The invocations corresponding to a given patch execute logically in parallel, with undefined relative execution order. However, the OpControlBarrier instruction can be used to provide limited control of the execution order by synchronizing invocations within a patch, effectively dividing tessellation control shader execution into a set of phases. Tessellation control shaders will read undefined values if one invocation reads a per-vertex or per-patch output written by another invocation at any point during the same phase, or if two invocations attempt to write different values to the same per-patch output in a single phase.

Tessellation Evaluation Shaders

The Tessellation Evaluation Shader operates on an input patch of control points and their associated data, and a single input barycentric coordinate indicating the invocation’s relative position within the subdivided patch, and outputs a single vertex and its associated data.

Tessellation Evaluation Shader Execution

A tessellation evaluation shader is invoked at least once for each unique vertex generated by the tessellator. If the subpass includes multiple views in its view mask, the shader may be invoked separately for each view.

Geometry Shaders

The geometry shader operates on a group of vertices and their associated data assembled from a single input primitive, and emits zero or more output primitives and the group of vertices and their associated data required for each output primitive.

Geometry Shader Execution

A geometry shader is invoked at least once for each primitive produced by the tessellation stages, or at least once for each primitive generated by primitive assembly when tessellation is not in use. A shader can request that the geometry shader runs multiple instances. A geometry shader is invoked at least once for each instance. If the subpass includes multiple views in its view mask, the shader may be invoked separately for each view.

Fragment Shaders

Fragment shaders are invoked as a fragment operation in a graphics pipeline. Each fragment shader invocation operates on a single fragment and its associated data. With few exceptions, fragment shaders do not have access to any data associated with other fragments and are considered to execute in isolation of fragment shader invocations associated with other fragments.

Compute Shaders

Compute shaders are invoked via dispatching commands. In general, they have access to similar resources as shader stages executing as part of a graphics pipeline.

Compute workloads are formed from groups of work items called workgroups and processed by the compute shader in the current compute pipeline. A workgroup is a collection of shader invocations that execute the same shader, potentially in parallel. Compute shaders execute in global workgroups which are divided into a number of local workgroups with a size that can be set by assigning a value to the LocalSize or LocalSizeId execution mode or via an object decorated by the WorkgroupSize decoration. An invocation within a local workgroup can share data with other members of the local workgroup through shared variables and issue memory and control flow barriers to synchronize with other members of the local workgroup.

Ray Generation Shaders

A ray generation shader is similar to a compute shader. Its main purpose is to execute ray tracing queries using pipeline trace ray instructions (such as OpTraceRayKHR) and process the results.

Ray Generation Shader Execution

One ray generation shader is executed per ray tracing dispatch. Its location in the shader binding table (see Shader Binding Table for details) is passed directly into vkCmdTraceRaysKHR using the pRaygenShaderBindingTable parameter or vkCmdTraceRaysNV using the raygenShaderBindingTableBuffer and raygenShaderBindingOffset parameters .

Intersection Shaders

Intersection shaders enable the implementation of arbitrary, application defined geometric primitives. An intersection shader for a primitive is executed whenever its axis-aligned bounding box is hit by a ray.

Like other ray tracing shader domains, an intersection shader operates on a single ray at a time. It also operates on a single primitive at a time. It is therefore the purpose of an intersection shader to compute the ray-primitive intersections and report them. To report an intersection, the shader calls the OpReportIntersectionKHR instruction.

An intersection shader communicates with any-hit and closest shaders by generating attribute values that they can read. Intersection shaders cannot read or modify the ray payload.

Intersection Shader Execution

The order in which intersections are found along a ray, and therefore the order in which intersection shaders are executed, is unspecified.

The intersection shader of the closest AABB which intersects the ray is guaranteed to be executed at some point during traversal, unless the ray is forcibly terminated.

Any-Hit Shaders

The any-hit shader is executed after the intersection shader reports an intersection that lies within the current [tmin,tmax] of the ray. The main use of any-hit shaders is to programmatically decide whether or not an intersection will be accepted. The intersection will be accepted unless the shader calls the OpIgnoreIntersectionKHR instruction. Any-hit shaders have read-only access to the attributes generated by the corresponding intersection shader, and can read or modify the ray payload.

Any-Hit Shader Execution

The order in which intersections are found along a ray, and therefore the order in which any-hit shaders are executed, is unspecified.

The any-hit shader of the closest hit is guaranteed to be executed at some point during traversal, unless the ray is forcibly terminated.

Closest Hit Shaders

Closest hit shaders have read-only access to the attributes generated by the corresponding intersection shader, and can read or modify the ray payload. They also have access to a number of system-generated values. Closest hit shaders can call pipeline trace ray instructions to recursively trace rays.

Closest Hit Shader Execution

Exactly one closest hit shader is executed when traversal is finished and an intersection has been found and accepted.

Miss Shaders

Miss shaders can access the ray payload and can trace new rays through the pipeline trace ray instructions, but cannot access attributes since they are not associated with an intersection.

Miss Shader Execution

A miss shader is executed instead of a closest hit shader if no intersection was found during traversal.

Callable Shaders

Callable shaders can access a callable payload that works similarly to ray payloads to do subroutine work.

Callable Shader Execution

A callable shader is executed by calling OpExecuteCallableKHR from an allowed shader stage.

Interpolation Decorations

Variables in the Input storage class in a fragment shader’s interface are interpolated from the values specified by the primitive being rasterized.

Interpolation decorations can be present on input and output variables in pre-rasterization shaders but have no effect on the interpolation performed.

An undecorated input variable will be interpolated with perspective-correct interpolation according to the primitive type being rasterized. Lines and polygons are interpolated in the same way as the primitive’s clip coordinates. If the NoPerspective decoration is present, linear interpolation is instead used for lines and polygons. For points, as there is only a single vertex, input values are never interpolated and instead take the value written for the single vertex.

If the Flat decoration is present on an input variable, the value is not interpolated, and instead takes its value directly from the provoking vertex. Fragment shader inputs that are signed or unsigned integers, integer vectors, or any double-precision floating-point type must be decorated with Flat.

Interpolation of input variables is performed at an implementation-defined position within the fragment area being shaded. The position is further constrained as follows:

  • If the Centroid decoration is used, the interpolation position used for the variable must also fall within the bounds of the primitive being rasterized.
  • If the Sample decoration is used, the interpolation position used for the variable must be at the position of the sample being shaded by the current fragment shader invocation.
  • If a sample count of 1 is used, the interpolation position must be at the center of the fragment area.
As Centroid constrains the interpolation position to lie within the covered area of the primitive, using it may cause the position to differ between neighboring fragments when it otherwise would not. Derivatives calculated based on these differing locations can produce inconsistent results compared to undecorated inputs. Thus using Centroid with input variables used in derivative calculations is not recommended.

If the PerVertexKHR decoration is present on an input variable, the value is not interpolated, and instead values from all input vertices are available in an array. Each index of the array corresponds to one of the vertices of the primitive that produced the fragment.

If the CustomInterpAMD decoration is present on an input variable, the value cannot be accessed directly; instead the extended instruction InterpolateAtVertexAMD must be used to obtain values from the input vertices.

Push Constant Decorations

Variables in the PushConstant storage class can be decorated with additional parameters to control their placement and behavior within push constant banks.

The BankNV decoration specifies which hardware push constant bank a variable or block should be placed in. When present on a push constant variable or block, it indicates the hardware bank index to use for accessing the push constant data. When BankNV is absent, it behaves as if the value is 0.

The MemberOffsetNV decoration specifies an additional offset within a push constant bank for push constant variables or blocks. This decoration allows control over the placement of push constants within the specified bank, enabling more efficient memory layout and access patterns. When MemberOffsetNV is absent, it behaves as if the value is 0.

Static Use

A SPIR-V module declares a global object in memory using the OpVariable or OpUntypedVariableKHR instruction, which results in a pointer x to that object. A specific entry point in a SPIR-V module is said to statically use that object if that entry point’s call tree contains a function containing a instruction with x as an id operand. A shader entry point also statically uses any variables explicitly declared in its interface.

Scope

A scope describes a set of shader invocations, where each such set is a scope instance. Each invocation belongs to one or more scope instances, but belongs to no more than one scope instance for each scope.

The operations available between invocations in a given scope instance vary, with smaller scopes generally able to perform more operations, and with greater efficiency.

Cross Device

All invocations executed in a Vulkan instance fall into a single cross device scope instance.

Whilst the CrossDevice scope is defined in SPIR-V, it is disallowed in Vulkan. API synchronization commands can be used to communicate between devices.

Device

All invocations executed on a single device form a device scope instance.

If the vulkanMemoryModel and vulkanMemoryModelDeviceScope features are enabled, this scope is represented in SPIR-V by the Device Scope, which can be used as a Memory Scope for barrier and atomic operations.

If both the shaderDeviceClock and vulkanMemoryModelDeviceScope features are enabled, using the Device Scope with the OpReadClockKHR instruction will read from a clock that is consistent across invocations in the same device scope instance.

There is no method to synchronize the execution of these invocations within SPIR-V, and this can only be done with API synchronization primitives.

Invocations executing on different devices in a device group operate in separate device scope instances.

Queue Family

Invocations executed by queues in a given queue family form a queue family scope instance.

This scope is identified in SPIR-V as the QueueFamily Scope if the vulkanMemoryModel feature is enabled, or if not, the Device Scope, which can be used as a Memory Scope for barrier and atomic operations.

If the shaderDeviceClock feature is enabled, but the vulkanMemoryModelDeviceScope feature is not enabled, using the Device Scope with the OpReadClockKHR instruction will read from a clock that is consistent across invocations in the same queue family scope instance.

There is no method to synchronize the execution of these invocations within SPIR-V, and this can only be done with API synchronization primitives.

Each invocation in a queue family scope instance must be in the same device scope instance.

Command

Any shader invocations executed as the result of a single command such as vkCmdDispatch or vkCmdDraw form a command scope instance. For indirect drawing commands with drawCount greater than one, invocations from separate draws are in separate command scope instances. For ray tracing shaders, an invocation group is an implementation-dependent subset of the set of shader invocations of a given shader stage which are produced by a single trace rays command.

There is no specific Scope for communication across invocations in a command scope instance. As this has a clear boundary at the API level, coordination here can be performed in the API, rather than in SPIR-V.

Each invocation in a command scope instance must be in the same queue-family scope instance.

For shaders without defined workgroups, this set of invocations forms an invocation group as defined in the SPIR-V specification.

Primitive

Any fragment shader invocations executed as the result of rasterization of a single primitive form a primitive scope instance.

There is no specific Scope for communication across invocations in a primitive scope instance.

Any generated helper invocations are included in this scope instance.

Each invocation in a primitive scope instance must be in the same command scope instance.

Any input variables decorated with Flat are uniform within a primitive scope instance.

Shader Call

Any shader-call-related invocations that are executed in one or more ray tracing execution models form a shader call scope instance.

The ShaderCallKHR Scope can be used as Memory Scope for barrier and atomic operations.

Each invocation in a shader call scope instance must be in the same queue family scope instance.

Workgroup

A local workgroup is a set of invocations that can synchronize and share data with each other using memory in the Workgroup storage class.

The Workgroup Scope can be used as both an Execution

Scope and Memory Scope for barrier and atomic operations.

Each invocation in a local workgroup must be in the same command scope instance.

Only task, mesh, and compute shaders have defined workgroups - other shader types cannot use workgroup functionality. For shaders that have defined workgroups, this set of invocations forms an invocation group as defined in the SPIR-V specification.

When variables declared with the Workgroup storage class are explicitly laid out (hence they are also decorated with Block), the amount of storage consumed is the size of the largest Block variable, not counting any padding at the end. The amount of storage consumed by the non-Block variables declared with the Workgroup storage class is implementation-dependent. However, the amount of storage consumed may not exceed the largest block size that would be obtained if all active non-Block variables declared with Workgroup storage class were assigned offsets in an arbitrary order by successively taking the smallest valid offset according to the Standard Storage Buffer Layout rules, and with Boolean values considered as 32-bit integer values for the purpose of this calculation. (This is equivalent to using the GLSL std430 layout rules.)

Subgroup

A subgroup (see the subsection Control Flow of section 2 of the SPIR-V 1.3 Revision 1 specification) is a set of invocations that can synchronize and share data with each other efficiently.

The Subgroup Scope can be used as both an Execution

Scope and Memory Scope for barrier and atomic operations. Other subgroup features allow the use of group operations with subgroup scope.

If the shaderSubgroupClock feature is enabled, using the Subgroup Scope with the OpReadClockKHR instruction will read from a clock that is consistent across invocations in the same subgroup.

For shaders that have defined workgroups, each invocation in a subgroup must be in the same local workgroup.

In other shader stages, each invocation in a subgroup must be in the same device scope instance.

Only shader stages that support subgroup operations have defined subgroups.

Subgroups are not guaranteed to be a subset of a single command in shaders that do not have defined workgroups. Values that are guaranteed to be uniform for a given command or sub command may then not be uniform for the subgroup, and vice versa. As such, applications must take care when dealing with mixed uniformity.A somewhat common example of this would something like trying to optimize access to per-draw data using subgroup operations:
buffer { uint draw_data[]; };

flat in int vDrawID; // Passed through from vertex shader

void main()
{
    uint local_draw_data = subgroupBroadcastFirst(draw_data[local_draw_data]);
}
This can be done in an attempt to optimize the shader to only perform the loads once per subgroup. However, if the implementation packs multiple draws into a single subgroup, invocations from draws with a different drawID are now receiving data from the wrong invocation. Applications should rely on implementations to do this kind of optimization automatically where the implementation can, rather than trying to force it.

Quad

A quad scope instance is formed of four shader invocations.

In a fragment shader, each invocation in a quad scope instance is formed of invocations in neighboring framebuffer locations (xi, yi), where:

  • i is the index of the invocation within the scope instance.
  • w and h are the number of pixels the fragment covers in the x and y axes.
  • w and h are identical for all participating invocations.
  • (x0) = (x1 - w) = (x2) = (x3 - w)
  • (y0) = (y1) = (y2 - h) = (y3 - h)
  • Each invocation has the same layer and sample indices.

In a mesh, task, or compute shader, if the DerivativeGroupQuadsKHR execution mode is specified, each invocation in a quad scope instance is formed of invocations with adjacent local invocation IDs (xi, yi), where:

  • i is the index of the invocation within the quad scope instance.
  • (x0) = (x1 - 1) = (x2) = (x3 - 1)
  • (y0) = (y1) = (y2 - 1) = (y3 - 1)
  • x0 and y0 are integer multiples of 2.
  • Each invocation has the same z coordinate.

In a mesh, task, or compute shader, if the DerivativeGroupLinearKHR execution mode is specified, each invocation in a quad scope instance is formed of invocations with adjacent local invocation indices (li), where:

  • i is the index of the invocation within the quad scope instance.
  • (l0) = (l1 - 1) = (l2 - 2) = (l3 - 3)
  • l0 is an integer multiple of 4.

In all shaders, each invocation in a quad scope instance is formed of invocations in adjacent subgroup invocation indices (si), where:

  • i is the index of the invocation within the quad scope instance.
  • (s0) = (s1 - 1) = (s2 - 2) = (s3 - 3)
  • s0 is an integer multiple of 4.

Each invocation in a quad scope instance must be in the same subgroup.

In a fragment shader, each invocation in a quad scope instance must be in the same primitive scope instance.

Fragment , mesh, task, and compute shaders have defined quad scope instances. If the quadOperationsInAllStages limit is supported, any shader stages that support subgroup operations also have defined quad scope instances.

Fragment Interlock

A fragment interlock scope instance is formed of fragment shader invocations based on their framebuffer locations (x,y,layer,sample), executed by commands inside a single subpass.

The specific set of invocations included varies based on the execution mode as follows:

  • If the SampleInterlockOrderedEXT or SampleInterlockUnorderedEXT execution modes are used, only invocations with identical framebuffer locations (x,y,layer,sample) are included.
  • If the PixelInterlockOrderedEXT or PixelInterlockUnorderedEXT execution modes are used, fragments with different sample ids are also included.
  • If the ShadingRateInterlockOrderedEXT or ShadingRateInterlockUnorderedEXT execution modes are used, fragments from neighboring framebuffer locations are also included. The shading rate image or fragment shading rate determines these fragments.

Only fragment shaders with one of the above execution modes have defined fragment interlock scope instances.

There is no specific Scope value for communication across invocations in a fragment interlock scope instance. However, this is implicitly used as a memory scope by OpBeginInvocationInterlockEXT and OpEndInvocationInterlockEXT.

Each invocation in a fragment interlock scope instance must be in the same queue family scope instance.

Invocation

The smallest scope is a single invocation; this is represented by the Invocation Scope in SPIR-V.

Fragment shader invocations must be in a primitive scope instance.

Invocations in fragment shaders that have a defined fragment interlock scope must be in a fragment interlock scope instance.

Invocations in shaders that have defined workgroups must be in a local workgroup.

Invocations in shaders that have a defined subgroup scope must be in a subgroup.

Invocations in shaders that have a defined quad scope must be in a quad scope instance.

All invocations in all stages must be in a command scope instance.

Group Operations

Group operations are executed by multiple invocations within a scope instance; with each invocation involved in calculating the result. This provides a mechanism for efficient communication between invocations in a particular scope instance.

Group operations all take a Scope defining the desired scope instance to operate within. Only the Subgroup scope can be used for these operations; the subgroupSupportedOperations limit defines which types of operation can be used.

Basic Group Operations

Basic group operations include the use of OpGroupNonUniformElect, OpControlBarrier, OpMemoryBarrier, and atomic operations.

OpGroupNonUniformElect can be used to choose a single invocation to perform a task for the whole group. Only the invocation with the lowest id in the group will return true.

The Memory Model appendix defines the operation of barriers and atomics.

Vote Group Operations

The vote group operations allow invocations within a group to compare values across a group. The types of votes enabled are:

  • Do all active group invocations agree that an expression is true?
  • Do any active group invocations evaluate an expression to true?
  • Do all active group invocations have the same value of an expression?
These operations are useful in combination with control flow in that they allow for developers to check whether conditions match across the group and choose potentially faster code-paths in these cases.

Arithmetic Group Operations

The arithmetic group operations allow invocations to perform scans and reductions across a group. The operators supported are add, mul, min, max, and, or, xor.

For reductions, every invocation in a group will obtain the cumulative result of these operators applied to all values in the group. For exclusive scans, each invocation in a group will obtain the cumulative result of these operators applied to all values in invocations with a lower index in the group. Inclusive scans are identical to exclusive scans, except the cumulative result includes the operator applied to the value in the current invocation.

The order in which these operators are applied is implementation-dependent.

Ballot Group Operations

The ballot group operations allow invocations to perform more complex votes across the group. The ballot functionality allows all invocations within a group to provide a boolean value and get as a result what each invocation provided as their boolean value. The broadcast functionality allows values to be broadcast from an invocation to all other invocations within the group.

Shuffle Group Operations

The shuffle group operations allow invocations to read values from other invocations within a group.

Shuffle Relative Group Operations

The shuffle relative group operations allow invocations to read values from other invocations within the group relative to the current invocation in the group. The relative operations supported allow data to be shifted up and down through the invocations within a group.

Clustered Group Operations

The clustered group operations allow invocations to perform an operation among partitions of a group, such that the operation is only performed within the group invocations within a partition. The partitions for clustered group operations are consecutive power-of-two size groups of invocations and the cluster size must be known at pipeline creation time. The operations supported are add, mul, min, max, and, or, xor.

Rotate Group Operations

The rotate group operations allow invocations to read values from other invocations within the group relative to the current invocation and modulo the size of the group. Clustered rotate group operations perform the same operation within individual partitions of a group.

The partitions for clustered rotate group operations are consecutive power-of-two size groups of invocations and the cluster size must be known at pipeline creation time.

Quad Group Operations

Quad group operations (OpGroupNonUniformQuad*) are a specialized type of group operations that only operate on quad scope instances. Whilst these instructions do include a Scope parameter, this scope is always overridden; only the quad scope instance is included in its execution scope.

Fragment shaders that statically execute either OpGroupNonUniformQuadBroadcast or OpGroupNonUniformQuadSwap must launch sufficient invocations to ensure their correct operation; additional helper invocations are launched for framebuffer locations not covered by rasterized fragments if necessary.

The index used to select participating invocations is i, as described for a quad scope instance, defined as the quad index in the SPIR-V specification.

For OpGroupNonUniformQuadBroadcast this value is equal to Index. For OpGroupNonUniformQuadSwap, it is equal to the implicit Index used by each participating invocation.

Derivative Operations

Derivative operations calculate the partial derivative for an expression P as a function of an invocation’s x and y coordinates.

Derivative operations operate on a set of invocations known as a derivative group as defined in the SPIR-V specification.

A derivative group in a fragment shader is equivalent to the quad scope instance if the QuadDerivativesKHR execution mode is specified, otherwise it is equivalent to the primitive scope instance. A derivative group in a mesh, task, or compute shader is equivalent to the quad scope instance.

Derivatives are calculated assuming that P is piecewise linear and continuous within the derivative group.

The following control-flow restrictions apply to derivative operations:

  • If the QuadDerivativesKHR execution mode is specified, dynamic instances of any derivative operations must be executed in control flow that is uniform within the current quad scope instance.
  • If the QuadDerivativesKHR execution mode is not specified:
    • dynamic instances of explicit derivative instructions (OpDPdx*, OpDPdy*, and OpFwidth*) must be executed in control flow that is uniform within a derivative group.
    • dynamic instances of implicit derivative operations can be executed in control flow that is not uniform within the derivative group, but results are undefined.

Fragment shaders that statically execute derivative operations must launch sufficient invocations to ensure their correct operation; additional helper invocations are launched for framebuffer locations not covered by rasterized fragments if necessary.

In a mesh, task, or compute shader, it is the application’s responsibility to ensure that sufficient invocations are launched.

Derivative operations calculate their results as the difference between the result of P across invocations in the quad. For fine derivative operations (OpDPdxFine and OpDPdyFine), the values of DPdx(Pi) are calculated as

  • DPdx(P0) = DPdx(P1) = P1 - P0
  • DPdx(P2) = DPdx(P3) = P3 - P2

and the values of DPdy(Pi) are calculated as

  • DPdy(P0) = DPdy(P2) = P2 - P0
  • DPdy(P1) = DPdy(P3) = P3 - P1

where i is the index of each invocation as described in Quad.

Coarse derivative operations (OpDPdxCoarse and OpDPdyCoarse), calculate their results in roughly the same manner, but may only calculate two values instead of four (one for each of DPdx and DPdy), reusing the same result no matter the originating invocation. If an implementation does this, it should use the fine derivative calculations described for P0.

Derivative values are calculated between fragments rather than pixels. If the fragment shader invocations involved in the calculation cover multiple pixels, these operations cover a wider area, resulting in larger derivative values. This in turn will result in a coarser LOD being selected for image sampling operations using derivatives.Applications may want to account for this when using multi-pixel fragments; if pixel derivatives are desired, applications should use explicit derivative operations and divide the results by the size of the fragment in each dimension as follows:
  • DPdx(Pn)' = DPdx(Pn) / w
  • DPdy(Pn)' = DPdy(Pn) / h
where w and h are the size of the fragments in the quad, and DPdx(Pn)' and DPdy(Pn)' are the pixel derivatives.

The results for OpDPdx and OpDPdy may be calculated as either fine or coarse derivatives, with implementations favoring the most efficient approach. Implementations must choose coarse or fine consistently between the two.

Executing OpFwidthFine, OpFwidthCoarse, or OpFwidth is equivalent to executing the corresponding OpDPdx* and OpDPdy* instructions, taking the absolute value of the results, and summing them.

Executing an OpImage*Sample*ImplicitLod instruction is equivalent to executing OpDPdx(Coordinate) and OpDPdy(Coordinate), and passing the results as the Grad operands dx and dy.

It is expected that using the ImplicitLod variants of sampling functions will be substantially more efficient than using the ExplicitLod variants with explicitly generated derivatives.

Helper Invocations

When performing derivative or quad group operations in a fragment shader, additional invocations may be spawned in order to ensure correct results. These additional invocations are known as helper invocations and can be identified by a non-zero value in the HelperInvocation built-in. Stores and atomics performed by helper invocations must not have any effect on memory except for the Function, Private and Output storage classes, and values returned by atomic instructions in helper invocations are undefined.

While storage to Output storage class has an effect even in helper invocations, it does not mean that helper invocations have an effect on the framebuffer. Output variables in fragment shaders can be read from as well, and they behave more like Private variables for the duration of the shader invocation.

If the MaximallyReconvergesKHR execution mode is applied to the entry point, helper invocations must remain active for all instructions for the lifetime of the quad scope instance they are a part of. If the MaximallyReconvergesKHR execution mode is not applied to the entry point, helper invocations may be considered inactive for group operations other than derivative and quad group operations. All invocations in a quad scope instance may become permanently inactive at any point once the only remaining invocations in that quad scope instance are helper invocations.

Cooperative Matrices

A cooperative matrix type is a SPIR-V type where the storage for and computations performed on the matrix are spread across the invocations in a scope instance. These types give the implementation freedom in how to optimize matrix multiplies.

SPIR-V defines the types and instructions, but does not specify rules about what sizes/combinations are valid, and it is expected that different implementations may support different sizes.

vkGetPhysicalDeviceCooperativeMatrixPropertiesKHRReturns properties describing what cooperative matrix types are supported
vkGetPhysicalDeviceCooperativeMatrixFlexibleDimensionsPropertiesNVReturns properties describing what cooperative matrix types are supported
vkGetPhysicalDeviceCooperativeMatrixPropertiesNVReturns properties describing what cooperative matrix types are supported

Each VkCooperativeMatrixPropertiesKHR or VkCooperativeMatrixPropertiesNV structure describes a single supported combination of types for a matrix multiply/add operation ( OpCooperativeMatrixMulAddKHR or OpCooperativeMatrixMulAddNV ). The multiply can be described in terms of the following variables and types (in SPIR-V pseudocode):

    %A is of type OpTypeCooperativeMatrixKHR %AType %scope %MSize %KSize %MatrixAKHR
    %B is of type OpTypeCooperativeMatrixKHR %BType %scope %KSize %NSize %MatrixBKHR
    %C is of type OpTypeCooperativeMatrixKHR %CType %scope %MSize %NSize %MatrixAccumulatorKHR
    %Result is of type OpTypeCooperativeMatrixKHR %ResultType %scope %MSize %NSize %MatrixAccumulatorKHR

    %Result = %A * %B + %C // using OpCooperativeMatrixMulAddKHR
    %A is of type OpTypeCooperativeMatrixNV %AType %scope %MSize %KSize
    %B is of type OpTypeCooperativeMatrixNV %BType %scope %KSize %NSize
    %C is of type OpTypeCooperativeMatrixNV %CType %scope %MSize %NSize
    %D is of type OpTypeCooperativeMatrixNV %DType %scope %MSize %NSize

    %D = %A * %B + %C // using OpCooperativeMatrixMulAddNV

A matrix multiply with these dimensions is known as an MxNxK matrix multiply.

VkCooperativeMatrixPropertiesKHRStructure specifying cooperative matrix properties
VkCooperativeMatrixFlexibleDimensionsPropertiesNVStructure specifying cooperative matrix properties
VkCooperativeMatrixPropertiesNVStructure specifying cooperative matrix properties
VkScopeKHRSpecify SPIR-V scope
VkComponentTypeKHRSpecify SPIR-V cooperative matrix component type

Cooperative Vectors

A cooperative vector type is a SPIR-V vector type optimized for the evaluation of small neural networks.

SPIR-V defines the types and instructions, but does not specify rules about what combinations of types are valid, and it is expected that different implementations may support different combinations.

vkGetPhysicalDeviceCooperativeVectorPropertiesNVReturns properties describing what cooperative vector types are supported
VkCooperativeVectorPropertiesNVStructure specifying cooperative vector properties
vkConvertCooperativeVectorMatrixNVConvert a cooperative vector matrix from one layout and type to another
VkConvertCooperativeVectorMatrixInfoNVStructure specifying a request to convert the layout and type of a cooperative vector matrix
VkCooperativeVectorMatrixLayoutNVSpecify cooperative vector matrix layout
vkCmdConvertCooperativeVectorMatrixNVConvert a cooperative vector matrix from one layout and type to another

Validation Cache

VkValidationCacheEXTOpaque handle to a validation cache object
vkCreateValidationCacheEXTCreates a new validation cache
VkValidationCacheCreateInfoEXTStructure specifying parameters of a newly created validation cache
VkValidationCacheCreateFlagsEXTReserved for future use
vkMergeValidationCachesEXTCombine the data stores of validation caches
vkGetValidationCacheDataEXTGet the data store from a validation cache
VkValidationCacheHeaderVersionEXTEncode validation cache version
vkDestroyValidationCacheEXTDestroy a validation cache object

CUDA Modules

Creating a CUDA Module

VkCudaModuleNVOpaque handle to a CUDA module object
vkCreateCudaModuleNVCreates a new CUDA module object
VkCudaModuleCreateInfoNVStructure specifying the parameters to create a CUDA Module

Creating a CUDA Function Handle

VkCudaFunctionNVOpaque handle to a CUDA function object
vkCreateCudaFunctionNVCreates a new CUDA function object
VkCudaFunctionCreateInfoNVStructure specifying the parameters to create a CUDA Function

Destroying a CUDA Function

vkDestroyCudaFunctionNVDestroy a CUDA function

Destroying a CUDA Module

vkDestroyCudaModuleNVDestroy a CUDA module

Reading back CUDA Module Cache

After uploading the PTX kernel code, the module compiles the code to generate a binary cache with all the necessary information for the device to execute it. It is possible to read back this cache for later use, such as accelerating the initialization of further executions.

vkGetCudaModuleCacheNVGet CUDA module cache

Limitations

CUDA and Vulkan do not use the device in the same configuration. The following limitations must be taken into account:

  • It is not possible to read or write global parameters from Vulkan. The only way to share resources or send values to the PTX kernel is to pass them as arguments of the function. See Resources sharing between CUDA Kernel and Vulkan for more details.
  • No calls to functions external to the module PTX are supported.
  • Vulkan disables some shader/kernel exceptions, which could break CUDA kernels relying on exceptions.
  • CUDA kernels submitted to Vulkan are limited to the amount of shared memory, which can be queried from the physical capabilities. It may be less than what CUDA can offer.
  • CUDA instruction-level preemption (CILP) does not work.
  • CUDA Unified Memory will not work in this extension.
  • CUDA Dynamic parallelism is not supported.
  • vk*DispatchIndirect is not available.

Shader Instrumentation

Shaders can be instrumented to provide a runtime shader cost analysis.

Shader instrumentation is enabled for a pipeline when VK_PIPELINE_CREATE_2_INSTRUMENT_SHADERS_BIT_ARM is included in VkPipelineCreateFlags2CreateInfoKHR::flags.

Shader instrumentation is enabled for a shader object when VK_SHADER_CREATE_INSTRUMENT_SHADER_BIT_ARM is included in VkShaderCreateInfoEXT::flags.

Shader instrumentation will incur a runtime performance cost. Applications or tools are only expected to enable shader instrumentation during development, for performance profiling or debugging purposes, and to leave it disabled in production use of the applications.

Shader Instrumentation Metrics

vkEnumeratePhysicalDeviceShaderInstrumentationMetricsARMReturns properties describing what shader instrumentation metrics are supported
VkShaderInstrumentationMetricDescriptionARMStructure specifying shader instrumentation metric properties

Shader Instrumentation Objects

VkShaderInstrumentationARMOpaque handle to a shader instrumentation object
vkCreateShaderInstrumentationARMCreate a new shader instrumentation object
VkShaderInstrumentationCreateInfoARMStructure specifying parameters of a newly created shader instrumentation
vkDestroyShaderInstrumentationARMDestroy a shader instrumentation object

Shader Instrumentation Capture

vkCmdBeginShaderInstrumentationARMBegin shader instrumentation
vkCmdEndShaderInstrumentationARMEnd shader instrumentation

Shader Instrumentation Retrieval

vkGetShaderInstrumentationValuesARMRetrieve shader instrumentation data
VkShaderInstrumentationValuesFlagsARMReserved for future use
VkShaderInstrumentationMetricDataHeaderARMStructure describing the header of a metric block

The resultIndex is the index captured during command buffer recording, and identifies the draw, dispatch, or ray tracing command that the metrics are captured for.

Metrics are returned in ascending order of resultIndex values. Metrics with the same value of resultIndex are returned in ascending order of resultSubIndex values. Metrics with the same value of resultIndex and resultSubIndex are grouped by the value of stages. Metrics with the same value of resultIndex, resultSubIndex, and stages are returned in ascending order of basicBlockIndex values.

All metrics for commands that record multiple draws (such as indirect drawing commands with drawCount greater than one), dispatches, or involve groups of shaders (such as ray tracing pipelines), are returned using the same resultIndex.

Implementations may use a non-zero resultSubIndex to report more fine-grained metrics (such as per draw) for such commands, or aggregate all metrics for the command using resultSubIndex zero.

Metrics for commands recorded while multiview is enabled are returned as aggregated values across all views.

Implementations may aggregate metrics for multiple shader stages. The value of stages describes which shader stages have been aggregated.

basicBlockIndex describes the index of the basic block of the shader that metrics are captured for. If VkPhysicalDeviceShaderInstrumentationPropertiesARM::perBasicBlockGranularity is VK_FALSE, results are aggregated for the entire shader and reported as basic block zero.

Shader Instrumentation Clearing

vkClearShaderInstrumentationMetricsARMClear shader instrumentation metrics to zero