VK_EXT_device_fault.proposal
This document outlines functionality to allow applications to query for additional diagnostic information following device-loss.
Problem Statement
Device-loss errors can be challenging to diagnose. They can be triggered by a number of issues, including invalid application behavior, driver bugs, and physical failure or removal of hardware. Whilst the Vulkan Validation layers are recommended as a first step in diagnosing the majority of API usage issues, they are unable to address all possible causes of device-loss.
This proposal aims to provide application developers with additional information that may aid in diagnosing such errors.
Solution Space
Several options have been considered:
- Provide foundational extensions to enable the development of crash postmortem tooling
- Develop extensions or tools that aim to attribute faults to individual Vulkan objects
- Rely on individual vendor tools and extensions
This proposal focuses on the first option. It represents a partial solution, with further extensions required in order to fully enable crash postmortem tooling.
Proposal
API Features
The following features are exposed by the VK_EXT_device_fault
extension:
typedef struct VkPhysicalDeviceFaultFeaturesEXT {
VkStructureType sType;
void* pNext;
VkBool32 deviceFault;
VkBool32 deviceFaultVendorBinary;
} VkPhysicalDeviceFaultFeaturesEXT;
deviceFault
is the main feature enabling this extension’s functionality and
must be supported if this extension is supported.
deviceFaultVendorBinary
is an optional feature that enables support for
vendor-specific binary crash dumps, which may be interpreted via external vendor
tools.
Querying for Fault Information
Following device-loss, applications may query for additional diagnostic
information by calling vkGetDeviceFaultInfoEXT
.
typedef struct VkDeviceFaultCountsEXT {
VkStructureType sType;
void* pNext;
uint32_t addressInfoCount;
uint32_t vendorInfoCount;
VkDeviceSize vendorBinarySize;
} VkDeviceFaultCountsEXT;
typedef struct VkDeviceFaultInfoEXT {
VkStructureType sType;
void* pNext;
char description[VK_MAX_DESCRIPTION_SIZE];
VkDeviceFaultAddressInfoEXT* pAddressInfos;
VkDeviceFaultVendorInfoEXT* pVendorInfos;
void* pVendorBinaryData;
} VkDeviceFaultInfoEXT;
VKAPI_ATTR VkResult VKAPI_CALL vkGetDeviceFaultInfoEXT(
VkDevice device,
VkDeviceFaultCountsEXT* pFaultCounts,
VkDeviceFaultInfoEXT* pFaultInfo);
The signature of vkGetDeviceFaultInfoEXT
is intended to mirror the design of
existing query functions, where the second parameter (pFaultCounts
) indicates
size of output arrays, or the number of results written. However, device fault
information requires multiple output arrays. Therefore, a
VkDeviceFaultCountsEXT
structure is used to specify the sizes of multiple
arrays at once.
// Query number of available results
VkDeviceFaultCountsEXT faultCounts{};
faultCounts.sType = VK_STRUCTURE_TYPE_DEVICE_FAULT_COUNTS_EXT;
vkGetDeviceFaultInfoEXT(device, &faultCounts, NULL);
// Allocate output arrays and query fault data
VkDeviceFaultInfoEXT faultInfo{}
info.sType = VK_STRUCTURE_TYPE_DEVICE_FAULT_INFO_EXT;
info.pAddressInfos = (VkDeviceFaultAddressInfoEXT*) malloc(sizeof(VkDeviceFaultAddressInfoEXT) *
faultCounts.addressInfoCount);
info.pVendorInfos = (VkDeviceFaultVendorInfoEXT*) malloc(sizeof(VkDeviceFaultVendorInfoEXT) *
faultCounts.vendorInfoCount);
info.pVendorBinaryData = malloc(faultCounts.vendorBinarySize);
vkGetDeviceFaultInfoEXT(device, &faultCounts, &faultInfo);
Interpreting GPU Virtual Addresses
Implementations may return information on both page faults generated by invalid memory accesses, and instruction pointers indicating the instructions executing at the time of the fault.
typedef enum VkDeviceFaultAddressTypeEXT {
VK_DEVICE_FAULT_ADDRESS_TYPE_NONE_EXT = 0,
VK_DEVICE_FAULT_ADDRESS_TYPE_READ_INVALID_EXT = 1,
VK_DEVICE_FAULT_ADDRESS_TYPE_WRITE_INVALID_EXT = 2,
VK_DEVICE_FAULT_ADDRESS_TYPE_EXECUTE_INVALID_EXT = 3,
VK_DEVICE_FAULT_ADDRESS_TYPE_INSTRUCTION_POINTER_UNKNOWN_EXT = 4,
VK_DEVICE_FAULT_ADDRESS_TYPE_INSTRUCTION_POINTER_INVALID_EXT = 5,
VK_DEVICE_FAULT_ADDRESS_TYPE_INSTRUCTION_POINTER_FAULT_EXT = 6,
VK_DEVICE_FAULT_ADDRESS_TYPE_MAX_ENUM_EXT = 0x7FFFFFFF
} VkDeviceFaultAddressTypeEXT;
typedef struct VkDeviceFaultAddressInfoEXT {
VkDeviceFaultAddressTypeEXT addressType;
VkDeviceAddress reportedAddress;
VkDeviceSize addressPrecision;
} VkDeviceFaultAddressInfoEXT;
Page addresses and instruction pointers are reported as GPU virtual addresses, and additional extensions or vendor tools may be required in order to correlate these extensions with individual Vulkan objects.
Implementations may only be able to report these addresses with limited
precision. The combination of reportedAddress
and addressPrecision
allow the possible range of addresses to be calculated, such that:
lower_address = (pInfo->reportedAddress & ~(pInfo->addressPrecision-1))
upper_address = (pInfo->reportedAddress | (pInfo->addressPrecision-1))
It is valid for the reportedAddress
to contain a more precise address
than indicated by addressPrecision
.
In this case, the value of reportedAddress
should be
treated as an additional hint as to the value of the address that triggered the
page fault, or to the value of an instruction pointer.
Vendor Binary Crash Dumps
Optionally, implementations may also support the generation of vendor-specific binary blobs containing additional diagnostic information. All vendor-specific binaries will begin with a common header. The contents of the remainder of the binary blob are vendor-specific, and will require vendor-specific documentation or tools to interpret.
typedef struct VkDeviceFaultVendorBinaryHeaderVersionOneEXT {
uint32_t headerSize;
VkDeviceFaultVendorBinaryHeaderVersionEXT headerVersion;
uint32_t vendorID;
uint32_t deviceID;
uint32_t driverVersion;
uint8_t pipelineCacheUUID[VK_UUID_SIZE];
uint32_t applicationNameOffset;
uint32_t applicationVersion;
uint32_t engineNameOffset;
} VkDeviceFaultVendorBinaryHeaderVersionOneEXT;
Issues
1) Should vkGetDeviceFaultInfoEXT
return multiple faults?
RESOLVED: No. This extension only seeks to identify a single fault as a possible cause of device loss and not to maintain a log of multiple faults. We anticipate that in cases where a GPU does encounter multiple faults, there is a high probability that the faults would be duplicates, such as those caused by parallel execution of the same defective code.
2) Can vkGetDeviceFaultInfoEXT
be called prior to device loss?
RESOLVED: No. VK_KHR_fault_handling
in VulkanSC does support an equivalent
to this, but VK_KHR_fault_handling
aims to address a different use case, where
a fault log is polled prior to device loss to enable remedial action to be taken.
3) Do page faults need to report the actual address that was accessed, or should we allow reporting of the page address?
RESOLVED: Some IHVs hardware reports page faults at page alignment, or at some other hardware-unit dependent granularity, rather than the precise address that triggered the fault. All addresses are reported at hardware-unit dependent granularity, along with an associated precision indicator. This information can be used to compute an address range that contains the original address that triggered the fault.
4) How should we report cases where one of multiple pipelines may have caused a fault?
RESOLVED: In cases where a fault cannot be attributed to a single unique pipeline, reporting the set of possible candidates is desirable.
5) The page fault and instruction address information structures have similar structure. Should they be combined?
RESOLVED: Yes. These have been combined as VkDeviceFaultAddressInfoEXT
to reduce API surface area.
6) How should implementors approach extensibility for vendor-specific faults?
Should they rely on pNext
chains, or should the extension introduce a
generic structure to return vendor error codes and human-readable descriptions
in the base structure?
RESOLVED: Implementors should utilize the generic
VkDeviceFaultVendorInfoEXT
structures where applicable, and fallback to
extending pNext
chains where this is insufficient. Where a pNext
chain is required, vendors should tailor their human-readable error
descriptions to advise developers that additional information may be available.