VK_KHR_device_fault.proposal

This document outlines functionality to allow applications to query for diagnostic information about device faults following a VK_ERROR_DEVICE_LOST. This mirrors functionality first exposed via VK_EXT_device_fault.

It also extends this functionality to allow drivers to report faults which may normally be handled internally and not reported as device lost errors.

Additionally, it allows for reporting faults which are provoked asynchronously by a submitted command buffer, but were not triggered until after the queue submit had returned.

Problem Statement

A number of issues, including invalid application behavior, driver bugs, hardware bugs, and physical failures may cause the device to fault. When this happens, submitted work may not run to completion in which case any outputs will have undefined contents.

Such faults may or may not be reported as device lost errors. Specifically, an implementation can choose to not report device lost if it expects that work submitted in the future will run successfully, i.e., if the fault is recoverable and does not cause the device to become unable to process future submits ('lost').

Device lost errors are very disruptive because they are 'sticky' and require that the logical device is recreated before the error state is cleared. In practice, many applications are not written to gracefully recover from these errors. For these reasons, 'hiding' any recoverable faults occur can be a pragmatic choice.

The VK_EXT_device_fault extension was introduced to help diagnose device lost errors, which can be challenging to resolve. But this extension cannot be used to diagnose hidden recoverable faults because it requires that the device is already in the lost state.

This proposal aims to encompass the functionality provided by VK_EXT_device_fault whilst widening the scope to provide application developers with a method to receive notifications and information about fault events irrespective of whether or not they resulted in a device lost error.

Such information could be used by a developer to diagnose visual artifacts, performance issues or as input to telemetry.

Solution Space

Several options have been considered.

Allow vkGetDeviceFaultInfoEXT/KHR() to be called ad-hoc (Polling)

An alternative could be to just promote vkGetDeviceFaultInfoEXT to KHR & allow vkGetDeviceFaultInfoKHR to be called before the device is in the 'lost' state - however, that does not support any form of blocking query, so would incur more CPU overhead for polling to be effective, and would not differentiate device lost & masked faults.

Introduce a new error code

This option would introduce a new error code (e.g., "VK_ERROR_RECOVERABLE_FAULT_KHR") that could be returned by the same API commands that can return VK_ERROR_DEVICE_LOST.

The error would not put the logical device in a permanent 'lost' state, but would either:

  • be reported as a one-time event (like other errors)
  • be reported until explicitly acknowledged and cleared (e.g., by calling a new command like vkDeviceRecoverKHR)

The downside to this option is that many applications are not written to handle any runtime errors, which means that in practice any such error would result in application termination.

Introduce an extended reporting mechanism: Polling With Timeout

This option adds to the capability of the previous "Polling" option to allow blocking waits. This would allow applications to dispatch a low CPU usage fault catcher thread, whilst still having a firm entry/exit point into the driver.

Introduce an extended reporting mechanism: Callbacks

This option adds a callback to return information on faults.

This information provided by the callback mechanism is the same as for the polling approach.

The benefits of the callback mechanism would be that the application would be notified immediately when events occur, rather than manually polling.

Disadvantage would be in complexity of threading behavior - to ensure prompt callback delivery, it implies a driver-side thread and providing guarantees as to application thread calling state at the point of dispatch would be difficult.

Proposal

From the solution space previously detailed, the option _introduce_an_extended_reporting_mechanism_polling_with_timeout was selected as delivering many of the asynchronous delivery advantages of callbacks whilst still providing a deterministic entry point to the driver with guaranteed delivery thread context, whilst being less intrusive/problematic in terms of required application-side support than adding additional error codes.

API Features

Querying for Fault Information

Applications can query for the presence of fault reports and corresponding diagnostic information at any time by calling vkGetDeviceFaultReportsKHR.

vkGetDeviceFaultReportsKHR() replaces the old vkGetDeviceFaultInfoEXT() API and aims to unify fault reporting in a more extensible fashion, whilst allowing for blocking waits via a timeout parameter.

vkGetDeviceFaultReportsKHR() differs from vkGetDeviceFaultInfoEXT() in four significant ways:

  1. It may be called at any time, without the requirement for a VK_ERROR_DEVICE_LOST condition to exist prior to the call.
  2. It may report faults that did not result in a VK_ERROR_DEVICE_LOST condition (if enabled via the deviceFaultReportMasked feature).
  3. It provides support for blocking queries.
  4. It is not the retrieval API for vendor binary dumps.
// Retrieve fault entries
VKAPI_ATTR VkResult VKAPI_CALL vkGetDeviceFaultReportsKHR(
    VkDevice                                    device,
    uint64_t                                    timeout,
    uint32_t*                                   pFaultCount,
    VkDeviceFaultReportKHR*                     pFaultInfos);

The signature of vkGetDeviceFaultReportsKHR follows the convention of existing query functions (input parameters first), with the third parameter pFaultCount providing the size of output array in the subsequent parameter pFaultInfos (on input to fault retrieval), returning the number of results written to pFaultInfos (on output from fault retrieval), or is populated with the number of available results (on sizing query).

Faults are returned in order of occurrence.

pFaultCount must not be NULL.

pFaultInfos points to an array of size *pFaultCount entries to return, or NULL. If pFaultInfos==NULL, then the number of available results is returned in *pFaultCount. The fault entries are returned in order of occurrence.

timeout is the timeout period in units of nanoseconds. timeout is adjusted to the closest value allowed by the implementation-dependent timeout accuracy, which may be substantially longer than one nanosecond, and may be longer than the requested period. If a zero timeout is passed then the function returns immediately irrespective of whether any faults are available.

The entries returned by vkGetDeviceFaultReportsKHR() take the form shown below, with populated fault reporting fields indicated by the flags field.

Each individual fault report is returned exactly once.

vkGetDeviceFaultReportsKHR() can be invoked in parallel from different threads, in which case each invocation for a given device will return a unique set of reports, with no fault report being returned to more than one invocation.

typedef struct {
    VkStructureType             sType;
    void*                       pNext;
    VkDeviceFaultFlagsKHR       flags;              // indicates masking/device-loss/timeout status + which members are populated
    uint64_t                    groupID;            // unique groupID for grouping multiple faults
    char                        description[VK_MAX_DESCRIPTION_SIZE];
    VkDeviceFaultAddressInfoKHR faultAddressInfo;       // flags & VK_DEVICE_FAULT_FLAG_MEMORY_ADDRESS_KHR
    VkDeviceFaultAddressInfoKHR instructionAddressInfo; // flags & VK_DEVICE_FAULT_FLAG_INSTRUCTION_ADDRESS_KHR
    VkDeviceFaultVendorInfoKHR  vendorInfo;             // flags & VK_DEVICE_FAULT_FLAG_VENDOR_KHR
} VkDeviceFaultInfoKHR;

The flags field is a bitmask of the following values:

typedef enum {
    VK_DEVICE_FAULT_FLAG_DEVICE_LOST_KHR         = 1,
    VK_DEVICE_FAULT_FLAG_MEMORY_ADDRESS_KHR      = 2,
    VK_DEVICE_FAULT_FLAG_INSTRUCTION_ADDRESS_KHR = 4,
    VK_DEVICE_FAULT_FLAG_VENDOR_KHR              = 8,
    VK_DEVICE_FAULT_FLAG_WATCHDOG_TIMEOUT_KHR    = 16,
    VK_DEVICE_FAULT_FLAG_OVERFLOW_KHR            = 32,
} VkDeviceFaultFlagsKHR;

VK_DEVICE_FAULT_FLAG_DEVICE_LOST_KHR is a special flag, indicating that the reported fault triggered VK_ERROR_DEVICE_LOST and that no subsequent faults will be returned. If several VkDeviceFaultInfoKHR records are generated by a single fault which triggers VK_ERROR_DEVICE_LOST, they should be grouped with a single groupID and the last entry in the group marked with VK_DEVICE_FAULT_FLAG_DEVICE_LOST_KHR. This flag is also set for faults which have been made fatal via the deviceFaultDeviceLostOnMasked feature flag.

VK_DEVICE_FAULT_FLAG_OVERFLOW_KHR is a special flag, indicating that an internal fault log buffer overflow has occurred in the driver. For example, if it implemented the fault record as a ring buffer, it has reached capacity up and is utilizing an LRU scheme so is overwriting older fault records. This flag is only set on the first fault entry read following missed faults due to an overflow.

VK_DEVICE_FAULT_FLAG_MEMORY_ADDRESS_KHR indicates that the VkDeviceFaultInfoKHR faultAddressInfo field has been populated.

VK_DEVICE_FAULT_FLAG_INSTRUCTION_ADDRESS_KHR indicates that the VkDeviceFaultInfoKHR instructionAddressInfo field has been populated.

VK_DEVICE_FAULT_FLAG_VENDOR_KHR indicates that the VkDeviceFaultInfoKHR vendorInfo field has been populated.

VK_DEVICE_FAULT_FLAG_WATCHDOG_TIMEOUT_KHR indicates that a GPU timeout has occurred (further information may be supplied via platform specific extensions to the VkDeviceFaultInfoKHR structure’s pNext chain).

A groupID field is included to allow association of multiple faults to a single event (eg. where multiple page faults are triggered from a single event), and should be monotonically incrementing. Where an implementation is unable to group events, the groupID should increment for every reported event.

The VkDeviceFaultVendorInfoKHR structure is a direct promotion/alias of the existing VkDeviceFaultVendorInfoEXT structure:

typedef struct VkDeviceFaultVendorInfoKHR {
    char        description[VK_MAX_DESCRIPTION_SIZE];
    uint64_t    vendorFaultCode;
    uint64_t    vendorFaultData;
} VkDeviceFaultVendorInfoKHR;

description must be a null-terminated UTF-8 string, and may provide a human readable description of the fault.

The exact meaning/values of the vendorFaultCode and vendorFaultData fields are vendor-defined.

Return values

  • VK_SUCCESS is returned if the query completed within the specified timeout period and at least one fault information was returned.
  • VK_TIMEOUT is returned if no fault information is available within the specified timeout period, even in the case that timeout was zero and no wait was actually performed.
  • VK_INCOMPLETE is returned if more fault reports are available than space given in the pFaultCount parameter.

Interpreting GPU Virtual Addresses

Implementations may return information on both page faults generated by invalid memory accesses, and instruction pointers indicating the instructions executing at the time of the fault.

typedef enum VkDeviceFaultAddressTypeKHR {
    VK_DEVICE_FAULT_ADDRESS_TYPE_NONE_KHR = 0,
    VK_DEVICE_FAULT_ADDRESS_TYPE_READ_INVALID_KHR = 1,
    VK_DEVICE_FAULT_ADDRESS_TYPE_WRITE_INVALID_KHR = 2,
    VK_DEVICE_FAULT_ADDRESS_TYPE_EXECUTE_INVALID_KHR = 3,
    VK_DEVICE_FAULT_ADDRESS_TYPE_INSTRUCTION_POINTER_UNKNOWN_KHR = 4,
    VK_DEVICE_FAULT_ADDRESS_TYPE_INSTRUCTION_POINTER_INVALID_KHR = 5,
    VK_DEVICE_FAULT_ADDRESS_TYPE_INSTRUCTION_POINTER_FAULT_KHR = 6,
    VK_DEVICE_FAULT_ADDRESS_TYPE_MAX_ENUM_KHR = 0x7FFFFFFF
} VkDeviceFaultAddressTypeKHR;

typedef struct VkDeviceFaultAddressInfoKHR {
    VkDeviceFaultAddressTypeKHR    addressType;
    VkDeviceAddress                reportedAddress;
    VkDeviceSize                   addressPrecision;
} VkDeviceFaultAddressInfoKHR;

Page addresses and instruction pointers are reported as GPU virtual addresses, and additional extensions or vendor tools may be required in order to correlate these extensions with individual Vulkan objects.

Implementations may only be able to report these addresses with limited precision. The combination of reportedAddress and addressPrecision allow the possible range of addresses to be calculated, such that:

lower_address = (pInfo->reportedAddress & ~(pInfo->addressPrecision-1))
upper_address = (pInfo->reportedAddress |  (pInfo->addressPrecision-1))

Retrieving the Vendor Binaries and other Fault Debug information

Making the vendor binary retrieval a distinct API allows us to restrict its usage to only situations where VK_ERROR_DEVICE_LOST has been returned.

vkGetDeviceFaultDebugInfoKHR() returns a single vendor binary which reflects the state of the device when device loss occurred. Vendor binary availability does not persist beyond device destruction.

Once a vendor binary has been retrieved, repeated calls to vkGetDeviceFaultDebugInfoKHR() will return the same vendor binary.

// Retrieve Extended Fault Info (Vendor Binary Dump, etc)
VKAPI_ATTR VkResult VKAPI_CALL vkGetDeviceFaultDebugInfoKHR(
    VkDevice                                    device,
    VkDeviceFaultDebugInfoKHR                   *pDebugInfo);

Where device is a device which must have returned VK_ERROR_DEVICE_LOST and pDebugInfo must be a pointer to a VkDeviceFaultDebugInfoKHR structure.

In cases where the application will destroy/recreate the device, it is the responsibility of the application code to ensure that the device is not destroyed prior to calling vkGetDeviceFaultDebugInfoKHR().

typedef struct VkDeviceFaultDebugInfoKHR {
    VkStructureType                 sType;
    void*                           pNext;              // Can chain VkDeviceFaultShaderAbortMessageCountsKHR and/or VkDeviceFaultShaderAbortMessageInfoKHR on this
    uint32_t                        vendorBinarySize;
    void*                           pVendorBinary;      // If vendorBinarySize is non-zero, pVendorBinary must not be NULL and must point to a buffer of size vendorBinarySize.
} VkDeviceFaultDebugInfoKHR;

The vendor binary is retrieved via a VkDeviceFaultDebugInfoKHR structure, which may be extended via the pNext chain to retrieve further information (for example, via the shader abort extension).

The VkDeviceFaultDebugInfoKHR structure follows the convention of existing query functions, where the vendorBinarySize field indicates size of output array (pVendorBinary) in bytes, the number of bytes written, or is populated by the driver with the number of bytes required to retrieve the vendor binary blob.

pVendorBinary points to a buffer of size vendorBinarySize bytes, or NULL for a sizing query in which case the vendorBinarySize field is populated with the number of bytes required for retrieval.

VkDeviceFaultDebugInfoKHR may be extended to retrieve further information relating to device loss, for example using a VkDeviceFaultShaderAbortMessageInfoKHR structure where VK_KHR_shader_abort is in use. state debug information via the pNext chain (for example, by VK_KHR_shader_abort)

Return values

  • VK_SUCCESS is returned if a valid vendor binary has been returned in *pVendorBinaryData or if pVendorBinaryData is null and the size of the vendor binary has been returned in *pVendorBinarySize.
  • VK_ERROR_NOT_ENOUGH_SPACE_KHR is returned if not enough space was provided for the vendor binary to be returned.

VK_KHR_device_fault Feature Flags

The following features are exposed by the VK_KHR_device_fault extension:

typedef struct VkPhysicalDeviceFaultFeaturesKHR {
    VkStructureType    sType;
    void*              pNext;
    VkBool32           deviceFault;
    VkBool32           deviceFaultVendorBinary;
    VKBool32           deviceFaultReportMasked;
    VKBool32           deviceFaultDeviceLostOnMasked;
} VkPhysicalDeviceFaultFeaturesKHR;

deviceFault is the main feature enabling this extension’s functionality and must be supported if this extension is supported.

deviceFaultVendorBinary is an optional feature that enables support for vendor-specific binary crash dumps, which may be interpreted via external vendor tools. These are only generated after device-loss.

deviceFaultReportMasked is an optional feature which enables faults that would normally be masked by the implementation (ie. automatically recovered by the driver internally without the application receiving a VK_ERROR_DEVICE_LOST error) to be reported via this extension even if they did not result in a VK_ERROR_DEVICE_LOST condition being returned to the application.

deviceFaultDeviceLostOnMasked is an optional feature that if supported & enabled, causes the driver to return VK_ERROR_DEVICE_LOST for faults which would otherwise be masked by the implementation.

Limits

typedef struct VkPhysicalDeviceFaultPropertiesKHR {
    VkStructureType    sType;
    void*              pNext;
    uint32_t           maxDeviceFaultCount;
} VkPhysicalDeviceFaultPropertiesKHR;

Implementations are expected to retain fault reports in a fixed size buffer.

maxDeviceFaultCount is the maximum number of faults for which an implementations is required to retain information. If the number of faults generated exceeds this limit, then the oldest records will be overwritten.

Querying for faults via vkGetDeviceFaultReportsKHR() will drain records from the fault buffer, freeing space for new records.

maxDeviceFaultCount must be greater than or equal to 1.

Examples

Enabling Extension

  VkPhysicalDeviceFaultFeaturesKHR deviceDeviceFaultFeatures = {};
  deviceDeviceFaultFeatures.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_FAULT_FEATURES_KHR;
  deviceDeviceFaultFeatures.deviceFault = VK_TRUE;
  deviceDeviceFaultFeatures.deviceFaultReportMasked = VK_TRUE;
  //...

Polling for Faults

// Query number of available results
uint32_t faultCounts;
if ((vkGetDeviceFaultReportsKHR(device, 0, &faultCounts, NULL) == VK_SUCCESS)
    && (faultCounts)) {
    // Allocate output arrays and query fault data
    VkDeviceFaultInfoKHR *pFaultInfo;
    pFaultInfo = (VkDeviceFaultInfoKHR*)calloc(faultCounts, sizeof(VkDeviceFaultInfoKHR));
    for(int n = 0; n < faultCounts; n++) {
        pFaultInfo[n].sType = VK_STRUCTURE_TYPE_DEVICE_FAULT_INFO_KHR;
    }
    vkGetDeviceFaultReportsKHR(device, 0, &faultCounts, pFaultInfo);
    if (((r == VK_SUCCESS) || (r == VK_INCOMPLETE)) && (faultCount)) {
        // a fault is returned, do something with it
    } else if (r == VK_TIMEOUT) {
        // not an error, but a chance to exit if this is being run on a thread
    } else {
        // do something about the error return?
    }
    free(pFaultInfo);
}

Blocking to wait for faults

The application may choose to implement a scheme (likely on a separate thread) which uses a blocking query to wait for fault information to become available.

// Query number of available results
uint32_t faultCount;

while(true) {
    // Blocking query for fault counts
    if ((vkGetDeviceFaultReportsKHR(device, 1000, &faultCounts, NULL) == VK_SUCCESS)
        && (faultCounts)) {
        // Allocate output arrays and query fault data
        VkDeviceFaultInfoKHR *pFaultInfo;
        pFaultInfo = (VkDeviceFaultInfoKHR*)calloc(faultCounts, sizeof(VkDeviceFaultInfoKHR));
        for(int n = 0; n < faultCounts; n++) {
            pFaultInfo[n].sType = VK_STRUCTURE_TYPE_DEVICE_FAULT_INFO_KHR;
        }
        // Non-blocking query to the number of read faults back, we already know how many
        VkResult r = vkGetDeviceFaultReportsKHR(device, 0, &faultCounts, pFaultInfo);
        if (((r == VK_SUCCESS) || (r == VK_INCOMPLETE)) && (faultCount)) {
            // a fault is returned, do something with it
        } else if (r == VK_TIMEOUT) {
            // not an error, but a chance to exit if this is being run on a thread
        } else {
            // do something about the error return?
        }
        free(pFaultInfo);
    }
}

Alternative (single query, no dynamic allocation):

// Query number of available results
uint32_t faultCount;
VkDeviceFaultInfoKHR faultInfo{}
faultInfo.sType             = VK_STRUCTURE_TYPE_DEVICE_FAULT_INFO_KHR;
while(true) {
    // Blocking query for single fault address
    faultCount = 1;
    VkResult r = vkGetDeviceFaultReportsKHR(device, 1000, &faultCount, &faultInfo);
    if (((r == VK_SUCCESS) || (r == VK_INCOMPLETE)) && (faultCount)) {
        // a fault is returned, do something with it
    } else if (r == VK_TIMEOUT) {
        // not an error, but a chance to exit if this is being run on a thread
    } else {
        // do something about the error return?
    }
}

Retrieving Vendor Binary in response to Device Lost:

// Query number of available results
uint32_t faultCount;
VkDeviceFaultInfoKHR faultInfo{}
faultInfo.sType             = VK_STRUCTURE_TYPE_DEVICE_FAULT_INFO_KHR;
while(true) {
    // Blocking query for single fault address
    faultCount = 1;
    VkResult r = vkGetDeviceFaultReportsKHR(device, 1000, &faultCount, &faultInfo);
    if (((r == VK_SUCCESS) || (r == VK_INCOMPLETE)) && (faultCount)) {
        // a fault is returned, do something with it
        if (faultInfo.flags & VK_DEVICE_FAULT_FLAG_DEVICE_LOST_KHR) {
            // This is a device lost fault...
            VkDeviceFaultDebugInfoKHR debugInfo = {
                .sType            = VK_STRUCTURE_TYPE_DEVICE_FAULT_DEBUG_INFO_KHR
                .pNext            = NULL,
                .vendorBinarySize = 0,
                .pVendorBinary    = NULL
            };
            // Sizing query
            if ((vkGetDeviceFaultDebugInfoKHR(device, &debugInfo) == VK_SUCCESS) && (debugInfo.vendorBinarySize)) {
                debugInfo.pVendorBinary = malloc(debugInfo.vendorBinarySize);
                // Vendor Binary Retrieval
                vkGetDeviceFaultDebugInfoKHR(device, &debugInfo);
            }
        }
    } else if (r == VK_TIMEOUT) {
        // not an error, but a chance to exit if this is being run on a thread
    } else {
        // do something about the error return - possibly nothing at all?
    }
}

Retrieving Shader Abort Messages:

// Query number of available results
uint32_t faultCount;
VkDeviceFaultInfoKHR faultInfo{}
faultInfo.sType             = VK_STRUCTURE_TYPE_DEVICE_FAULT_INFO_KHR;
while(true) {
    // Blocking query for single fault address
    faultCount = 1;
    VkResult r = vkGetDeviceFaultReportsKHR(device, 1000, &faultCount, &faultInfo);
    if (((r == VK_SUCCESS) || (r == VK_INCOMPLETE)) && (faultCount)) {
        // a fault is returned, do something with it
        if (faultInfo.flags & VK_DEVICE_FAULT_FLAG_DEVICE_LOST_KHR) {
            // This is a device lost fault... check for shader abort messages
            VkDeviceFaultShaderAbortMessageInfoKHR abortMessageInfo = {
                .sType           = VK_STRUCTURE_TYPE_DEVICE_FAULT_SHADER_ABORT_MESSAGE_INFO_KHR,
                .pNext           = NULL,
                .messageDataSize = 0,
                .messageData     = NULL
            };
            VkDeviceFaultDebugInfoKHR debugInfo = {
                .sType           = VK_STRUCTURE_TYPE_DEVICE_FAULT_DEBUG_INFO_KHR
                .pNext           = &abortMessageInfo, // VkDeviceFaultShaderAbortMessageInfoKHR extends VkDeviceFaultDebugInfoKHR
                .vendorBinarySize = 0,
                .pVendorBinary    = NULL
            };
            if ((vkGetDeviceFaultDebugInfoKHR(device, &debugInfo) == VK_SUCCESS) && (abortMessageInfo.messageDataSize)) {
                // There is a shader abort message payload available - allocate space & retrieve it
                abortMessageInfo.messageData  = malloc(abortMessageInfo.messageDataSize);
                vkGetDeviceFaultDebugInfoKHR(device, &debugInfo);

                // Process each shader abort message
                // NOTE: in this example, up to 20 parameters are supported - this is NOT an API restriction.
                char formattedOutput[1024]; // Formatted message output buffer
                size_t offset = 0;          // Offset in message data
                int n = 0;                  // Number of parameters found
                char *param[20];            // Extracted parameter array
                uint32_t paramSize[20];     // Extracted parameter sizes
                for(n = 0; (n < 20) && (offset < abortMessageInfo.messageDataSize); n++)
                {
                    paramSize[n] = *(uint32_t*)&abortMessageInfo.messageData[offset];
                    offset += sizeof(uint32_t);
                    param[n] = &abortMessageInfo.messageData[offset];
                    offset += paramSize[n];
                }

                myCustomPrint(param, paramSize, formattedOutput);

                free(abortMessageInfo.messageData);
            }
        }
    } else if (r == VK_TIMEOUT) {
        // not an error, but a chance to exit if this is being run on a thread
    } else {
        // do something about the error return - possibly nothing at all?
    }
}

Issues

Should the reporting mechanism be based on polling, notifications, or both?

RESOLVED: The VulkanSC style notification callbacks are removed from this proposal.

What should happen if there is a mismatch between queried infoCount and available infoCount?

What happens if you mismatch these counts compared to what is actually available (particularly for vendor binaries).

RESOLVED

vkGetDeviceFaultReportsKHR() should return VK_INCOMPLETE as long as any array (fault addresses or vendor binaries) is not fully drained.

What thread should the callback be called from?

PROPOSED: The faults will likely be detected asynchronously, we therefore allow them to be reported whenever they are detected, and that may happen on background threads.

RESOLVED: Callbacks removed from scope.

Can we reuse existing extensions and mechanisms more directly?

This proposal builds on VK_EXT_device_fault. Therefore, an alternative could be to just allow vkGetDeviceFaultInfoEXT to be called before the device is in the 'lost' state - however, that does not support any form of blocking query, so would incur more CPU overhead for polling to be effective.

Another option was to mirror the Fault Handling mechanism used in Vulkan SC but report a different set of data. However, in common with the modified VK_EXT_device_fault approach, this would also preclude a blocking query and has the disadvantage of bringing callbacks back into scope.

RESOLVED

Blocking Queries

The timeout parameter added is intended specifically to allow for non-polling fault-monitor thread implementations.

RESOLVED

Callbacks

Previous versions of this proposal included callbacks with the statement "The callback may be called from multiple threads simultaneously, including from a background thread other than the thread calling the Vulkan commands". This sounded error prone, so for clarity, if callbacks were to be supported they should be serialized by the UMD, with each fault being reported once only via the callback mechanism, rather than allowing for multiple threads to be simultaneously reporting possibly overlapping sections of a the fault logs.

RESOLVED

Behavior of parallel calls to vkGetDeviceFaultReportsKHR()

How should multiple calls to vkGetDeviceFaultReportsKHR() be handled in parallel?

Fault Log Ring Limits

Does the driver need to expose the upper limit of recordable non-device lost faults?

Clarify behavior when app does not query faults rapidly enough and the ring log overflows those limits. Possibilities include:

  • "drop subsequent faults, report first encountered"
  • "LRU ring eviction, drop the oldest fault, always report most recent".

RESOLVED LRU eviction is required to ensure that a fatal (device lost) error is not dropped due to ring overflow.

Do we need a properties struct to indicate the maximum number of VkDeviceFaultAddressInfoKHR and VkDeviceFaultVendorInfoKHR structures?

Yes. Added maxDeviceFaultCount in VkPhysicalDeviceFaultPropertiesKHR structure.

RESOLVED

Do we need a way to communicate overflow on VkDeviceFaultAddressInfoKHR and VkDeviceFaultVendorInfoKHR ring buffers?

Proposal: flag on first returned info following data loss.

RESOLVED: Added VK_DEVICE_FAULT_FLAG_OVERFLOW_KHR

Handling of vendorBinary for non-deviceLost queries

How should this be treated?

Proposal: No vendor binaries should be returned unless device lost has been reported.

RESOLVED

Should we update the return code for incomplete vendor binaries?

Propose yes - Requesting a vendor binary (i.e. passing a non-zero vendor binary size) and then providing insufficient storage should return VK_ERROR_NOT_ENOUGH_SPACE_KHR. This should occur before any address/vendor info structs are returned (i.e. the fault buffers should not be drained on error)

RESOLVED in separated vendor binary retrieval API.

Further Functionality

Additional functionality that could be considered:

  • adding a 'treat faults as errors' option to require that reported fault result in device lost.
  • adding parameters to control how many faults the implementation reports or tracks.

Revisions

1.0 Initial version