Bug 109565 - CmdBindDescriptorSets gets confused about dynamic offsets
Summary: CmdBindDescriptorSets gets confused about dynamic offsets
Status: RESOLVED WORKSFORME
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Vulkan/radeon (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: mesa-dev
QA Contact: mesa-dev
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-02-06 12:09 UTC by Jakub Okoński
Modified: 2019-04-09 13:38 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments

Description Jakub Okoński 2019-02-06 12:09:29 UTC
I've had crashes that I was unable diagnose on the same application (a custom one that I wrote) on mesa 18, but I never debug symbols or anything helpful.

I installed 19.0.0-rc2 and tried it again out of curiosity. It still crashes, but I get an assertion error at least:

v4: ../mesa-19.0.0-rc2/src/amd/vulkan/radv_cmd_buffer.c:2765: radv_CmdBindDescriptorSets: Assertion `dyn_idx < dynamicOffsetCount' failed.

Thread 8 "v4" received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff6e58700 (LWP 9437)]
0x00007ffff7c23d7f in raise () from /usr/lib/libc.so.6
(gdb) bt
#0  0x00007ffff7c23d7f in raise () from /usr/lib/libc.so.6
#1  0x00007ffff7c0e672 in abort () from /usr/lib/libc.so.6
#2  0x00007ffff7c0e548 in __assert_fail_base.cold.0 () from /usr/lib/libc.so.6
#3  0x00007ffff7c1c396 in __assert_fail () from /usr/lib/libc.so.6
#4  0x00007ffff49af3bf in ?? () from /usr/lib/libvulkan_radeon.so
#5  0x00007ffff4173548 in ?? () from /usr/lib/libVkLayer_unique_objects.so
#6  0x00007ffff40d936d in ?? () from /usr/lib/libVkLayer_unique_objects.so
#7  0x00007fffd54770e9 in ?? () from /usr/lib/libVkLayer_core_validation.so
#8  0x00007fffd50c6374 in ?? () from /usr/lib/libVkLayer_object_lifetimes.so
#9  0x00007fffd50351b0 in ?? () from /usr/lib/libVkLayer_object_lifetimes.so
#10 0x00007fffd4c34001 in ?? () from /usr/lib/libVkLayer_parameter_validation.so
#11 0x00007fffd494f684 in ?? () from /usr/lib/libVkLayer_thread_safety.so
#12 0x00007fffd48e5a43 in ?? () from /usr/lib/libVkLayer_thread_safety.so
#13 0x000055555588990c in ash::vk::DeviceFnV1_0::cmd_bind_descriptor_sets ()

This assertion only triggers with standard validation layers turned on. When I re-run my app without them enabled, the backtrace is similar but no assertion is triggered. This may indicate some weird interaction with validation layers perhaps?

Thread 2 "v4" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff7a5e700 (LWP 12868)]
0x00007ffff4bb01cd in ?? () from /usr/lib/libvulkan_radeon.so
(gdb) bt
#0  0x00007ffff4bb01cd in ?? () from /usr/lib/libvulkan_radeon.so
#1  0x000055555588711c in ash::vk::DeviceFnV1_0::cmd_bind_descriptor_sets ()

Because this is a segfault and not an explicit assertion that failed, it may be that these are separate problems.

Either way, I am not using any dynamic offsets and this code works on the proprietary AMD driver on Windows. Strangely, it does not work on amdvlk and in fact it puts my GPU in an infinite loop where I have to reboot the whole machine. It also works a Skylake iGPU using anv, but it hangs during shader execution for a different reason.

Please advise on what data to extract for debugging purposes.
Comment 1 Jakub Okoński 2019-02-06 12:12:32 UTC
This is on 19.0.0-rc2 compiled with --buildtype=debug, but debug symbols are still missing? I'm not sure why that is the case.
Comment 2 Samuel Pitoiset 2019-02-06 13:08:44 UTC
Are you sure you use dynamic bindings correctly first?
Can you share a link to your custom app?
Comment 3 Jakub Okoński 2019-02-06 13:26:13 UTC
I'm not sure what you mean, I don't use any of the *_DYNAMIC variants of DescriptorType.

This is how I define the layout of 2nd Descriptor Set I'm trying to bind in the failing call:
https://github.com/farnoy/renderer/blob/3d42799a1fa6f6881c86506d9f76670899f0f7a9/src/forward_renderer/renderer.rs#L113-L142

And this is how I bind it:
https://github.com/farnoy/renderer/blob/3d42799a1fa6f6881c86506d9f76670899f0f7a9/src/forward_renderer/renderer.rs#L703-L710
Comment 4 Jakub Okoński 2019-02-06 13:48:20 UTC
If I try binding just the 2nd one, it does not crash and lets me submit that command buffer to the compute queue. The problem must be with the first set (called mvp_set in my app). It fails to bind in a graphics context as well. I will keep fiddling to see if I can change the definition until it binds successfully.

Its (failing) layout is:

    vk::DescriptorSetLayoutBinding {
            binding: 0,
            descriptor_type: vk::DescriptorType::UNIFORM_BUFFER,
            descriptor_count: 1,
            stage_flags: vk::ShaderStageFlags::VERTEX | vk::ShaderStageFlags::COMPUTE,
            p_immutable_samplers: ptr::null(),
    }
Comment 5 Jakub Okoński 2019-02-06 14:04:59 UTC
So I've duplicated the binding of that set into two separate bindings, one for VERTEX stage, one for COMPUTE. And I updated them the same way:

Thread 0, Frame 0:
vkUpdateDescriptorSets(device, descriptorWriteCount, pDescriptorWrites, descriptorCopyCount, pDescriptorCopies) returns void:
    device:                         VkDevice = 0x5577e9626f10
    descriptorWriteCount:           uint32_t = 2
    pDescriptorWrites:              const VkWriteDescriptorSet* = 0x7ffdeda2aee0
        pDescriptorWrites[0]:           const VkWriteDescriptorSet = 0x7ffdeda2aee0:
            sType:                          VkStructureType = VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET (35)
            pNext:                          const void* = NULL
            dstSet:                         VkDescriptorSet = 0x5577e9626c00
            dstBinding:                     uint32_t = 0
            dstArrayElement:                uint32_t = 0
            descriptorCount:                uint32_t = 1
            descriptorType:                 VkDescriptorType = VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER (6)
            pImageInfo:                     const VkDescriptorImageInfo* = UNUSED
            pBufferInfo:                    const VkDescriptorBufferInfo* = 0x7ffdeda2ad80
                pBufferInfo[0]:                 const VkDescriptorBufferInfo = 0x7ffdeda2ad80:
                    buffer:                         VkBuffer = 0x5577e9626b10
                    offset:                         VkDeviceSize = 0
                    range:                          VkDeviceSize = 262144
            pTexelBufferView:               const VkBufferView* = UNUSED
        pDescriptorWrites[1]:           const VkWriteDescriptorSet = 0x7ffdeda2af20:
            sType:                          VkStructureType = VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET (35)
            pNext:                          const void* = NULL
            dstSet:                         VkDescriptorSet = 0x5577e9626c00
            dstBinding:                     uint32_t = 1
            dstArrayElement:                uint32_t = 0
            descriptorCount:                uint32_t = 1
            descriptorType:                 VkDescriptorType = VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER (6)
            pImageInfo:                     const VkDescriptorImageInfo* = UNUSED
            pBufferInfo:                    const VkDescriptorBufferInfo* = 0x7ffdeda2ad80
                pBufferInfo[0]:                 const VkDescriptorBufferInfo = 0x7ffdeda2ad80:
                    buffer:                         VkBuffer = 0x5577e9626b10
                    offset:                         VkDeviceSize = 0
                    range:                          VkDeviceSize = 262144
            pTexelBufferView:               const VkBufferView* = UNUSED
    descriptorCopyCount:            uint32_t = 0
    pDescriptorCopies:              const VkCopyDescriptorSet* = 0x5577e82ee320

Now I get a different assertion error, but only when using the api_dump layer (without other validation errors):

v4: ../mesa-19.0.0-rc2/src/amd/vulkan/radv_cmd_buffer.c:2727: radv_bind_descriptor_set: Assertion `!(set->layout->flags & VK_DESCRIPTOR_SET_LAYOUT_CREATE_PUSH_DESCRIPTOR_BIT_KHR)' failed.

This really confuses me, because I am not using or enabling the push descriptor extension, how could this bit be set?
Comment 6 Jakub Okoński 2019-02-10 21:15:22 UTC
I've made some unrelated changes (the mvp_set and its layout is still the same), and even started using advanced descriptor indexing features with partially bound descriptor sets and suddenly it's working. Not with validation layers though, that still segfaults when recording commands.
Comment 7 Samuel Pitoiset 2019-02-11 14:50:14 UTC
If you explain me how to build your own app, I should be able to reproduce the problem myself and investigate.
Comment 8 Jakub Okoński 2019-02-11 17:21:14 UTC
Alright, so here goes:

# glTF-Sample-Models submodule is pretty big
$ git clone --recurse-submodules https://github.com/farnoy/renderer.git

Install rustup from https://rustup.rs, linux distros usually package their own. Then you night a nightly version of the Rust compiler:

$ rustup install nightly

Then from the cloned repo:

# to get the version that uses descriptor indexing and partially bound sets:
$ git checkout a8be0be1c44cac83e1f24fc066a97ee7b82be516
$ rustup run nightly cargo run --release
# this should work and render some helmets and a ground mesh
# WSAD and mouse gets you around
$ rustup run nightly cargo run --release --features validation
# this enables lunarg validation layers which for me produce a SEGFAULT

# to get the old version of my app from the initial report:

$ git checkout 3d42799a1fa6f6881c86506d9f76670899f0f7a9
$ rustup run nightly cargo run --release --features validation
# On on vulkan ICD loader 1.1.96, these caused the weird dynamicOffset assertions in radv,
# on loader 1.1.99, these hang my GPU and I need to reset my machine

Let me know if you have any questions, I also upgraded the kernel from 4.20.6 to 4.20.7 since the original report, but I think it's more likely the vulkan loader upgrade changed behavior of my program in version 3d42799a1fa6f6881c86506d9f76670899f0f7a9.
Comment 9 Samuel Pitoiset 2019-04-08 14:46:23 UTC
Do you still have a problem with binding offsets?
Comment 10 Jakub Okoński 2019-04-09 12:34:40 UTC
Nope, I made some unrelated changes and I'm not hitting this problem anymore.
Comment 11 Samuel Pitoiset 2019-04-09 12:39:26 UTC
I assume it was a bug on your side?
Comment 12 Jakub Okoński 2019-04-09 12:43:16 UTC
I don't think so, like I said I wasn't using dynamic offsets at all, maybe validation layers added something but I'm not sure anymore.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.