Bug 108824

Summary: Invalid handling when GL buffer is bound on one context and invalidated on another
Product: Mesa Reporter: Baldur Karlsson <baldurk>
Component: Drivers/Gallium/radeonsiAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED MOVED QA Contact: Default DRI bug account <dri-devel>
Severity: normal    
Priority: medium CC: lonewolf, olivier.jolly
Version: git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
See Also: https://bugs.freedesktop.org/show_bug.cgi?id=110701
i915 platform: i915 features:
Attachments: piglit test showing broken behaviour
backtrace of crash when hitting this assert (from 18.3.3/19.0.0-rc1)
wip patch
likely fix

Description Baldur Karlsson 2018-11-21 20:35:15 UTC
Created attachment 142556 [details]
piglit test showing broken behaviour

I found some odd behaviour that I think I've tracked down to some incorrect handling of buffer invalidation in radeonsi.

The rough order of events is:

1. Create a buffer that's shared between two contexts. Ensure it's bound as a UBO on both.
2. Invalidate the buffer with e.g. glMapBufferRange(GL_MAP_INVALIDATE_BUFFER_BIT) on context A.
3. Context B's buffer bind is now in a bad state. Rendering will have unpredictable results, and invalidating the buffer again on context B may fail.

That's a bit vague but that's the general repro that I know for sure. This will then result in unpredictable reads/garbage data, and quite likely you'll eventually hit the assert on src/gallium/drivers/radeonsi/si_descriptors.c:1489 - assert(old_buf_va <= old_desc_va);

My understanding is that the radeonsi code will look through all bound buffers whenever an invalidate happens, fixup the descriptors by subtracting the descriptor's VA from the outgoing VA for the old buffer to get the offset, then add it onto the incoming VA and update the descriptor.

The problem seems to be that when this happens for a buffer invalidate it only checks the current context's bound buffers - so other contexts don't have their descriptors updated. That means the old VA is still being pointed at, and if an invalidate happens again on the second thread the descriptor is referring to an even older VA than the outgoing VA so there's no longer any sense in the subtract call.

I've attached a piglit test which hopefully should drop right in, it runs through the steps above and does a pixel readback to ensure the rendering went correctly. If you remove the readback you can see flickering output. It runs fine with both the readback and the rendering if I switch to swrast.

I'm on an RX 480 and tested the bug with both git-61b535437e and 18.2.4 from padoka's PPA.
Comment 1 olivier.jolly 2019-02-01 15:26:54 UTC
Created attachment 143269 [details]
backtrace of crash when hitting this assert (from 18.3.3/19.0.0-rc1)
Comment 2 olivier.jolly 2019-02-01 15:27:35 UTC
I also encounter what is most probably this same bug (same assertion at least) in a randomly fashion when using Blender 2.80.

My setup is debian unstable with a Radeon HD 7950 (and also GeForce GTX 1060 for Cuda only).

I encountered this crash on mesa 18.3.2 (packaged in debian), 18.3.3 and 19.0.0-rc1 (compiled manually)
Comment 3 dnicolas 2019-03-22 16:56:26 UTC
I'm also finding the same problem with Blender 2.80. Sometimes it crashes **very** often. Making it almost unusable.

Is there anyone who can take a look at this?

AMDGPU (Vega 56)
Kernel 4.20.15
Mesa 18.3.4
Fedora 29
Comment 4 Marek Olšák 2019-05-10 05:22:18 UTC
This is fixed by these patches:
Comment 5 Marek Olšák 2019-05-13 21:47:37 UTC
Baldur, can I set the license of your piglit test to MIT? Thanks.

Comment 6 Baldur Karlsson 2019-05-13 21:53:44 UTC
Yes, that's fine with me. I'll try to test the patches on my program soon.
Comment 7 Baldur Karlsson 2019-05-16 16:34:43 UTC
I applied the patchset on top of latest mesa (aa040d3b3c7d068e1ece61c71770c16a54745f89) and I seem to get some rendered corruption that I don't get with the parent commit before applying the patches.

It seems to only appear in RenderDoc, or at least it doesn't happen when running tiny demo programs. I can't isolate a simpler test case just now but it seems reliably reproducible and only shows up when I build with the patches applied.

To repro with RenderDoc:

* Download or build RenderDoc 1.4
* Build gears3d from https://github.com/gears3d/gears3d
* Launch gears3d through RenderDoc, capture, open the frame
* Step back and forth through the drawcalls and the texture viewer will show up with some corruption.

Screenshot here: https://i.imgur.com/1Dk7diS.png
Comment 8 LoneVVolf 2019-05-19 23:29:41 UTC
Baldur, I encounter similar visual corruption when running knetwalk.

See comment #12 in https://bugs.freedesktop.org/show_bug.cgi?id=110701#c12

Maybe these 2 bugs are related ?
Comment 9 LoneVVolf 2019-05-20 13:47:32 UTC
reverting commit https://cgit.freedesktop.org/mesa/mesa/commit/?id=78e35df52aa2f7d770f929a0866a0faa89c261a9 solves the visual corruption and gets rid of the gpu fault messages in dmesg.

As that commit is 2/2 of the patchset referenced in commit #4 , it does look like this introduces new errors.
see https://bugs.freedesktop.org/show_bug.cgi?id=110701
Comment 10 Pierre-Eric Pelloux-Prayer 2019-05-21 08:40:17 UTC
(In reply to Baldur Karlsson from comment #7)
> To repro with RenderDoc:
> * Download or build RenderDoc 1.4
> * Build gears3d from https://github.com/gears3d/gears3d
> * Launch gears3d through RenderDoc, capture, open the frame
> * Step back and forth through the drawcalls and the texture viewer will show
> up with some corruption.
> Screenshot here: https://i.imgur.com/1Dk7diS.png

I tried to reproduce the issue and actually had 2 different issues:
- before 12bf7cfecf52083c484602f971738475edfe497e: the rendering is corrupted as described above. Reverting 78e35df52aa2f7d770f929a0866a0faa89c261a9 fixes the rendering.

- starting from 12bf7cfecf52083c484602f971738475edfe497e: the rendering is corrupted and wrong: I only see the red gear, the green/blue ones are never drawn
Comment 11 Pierre-Eric Pelloux-Prayer 2019-05-21 16:28:18 UTC
Created attachment 144311 [details] [review]
wip patch

The following patch (applied on top of the problematic commit 78e35df52a) seems to fix the corruption problem (but I don't know the code enough to decide if it's a correct fix).
Comment 12 Marek Olšák 2019-05-21 18:35:30 UTC
Created attachment 144312 [details] [review]
likely fix

This patch should fix it. Thanks to Pierre-Eric for inspiring it.
Comment 13 LoneVVolf 2019-05-21 20:53:29 UTC
Applying the "likely fix" patch in https://bugs.freedesktop.org/show_bug.cgi?id=108824#c12 solves the issue with plasma shell/knetwalk on my rx 580.
Comment 14 raffarti 2019-05-22 21:59:28 UTC
The patch fixes corruption caused by 78e35df52aa2f7d770f929a0866a0faa89c261a9 but not the one from 12bf7cfecf52083c484602f971738475edfe497e, which still persists in scroll bars of falkon and akregator.
I'm using an RX 480.
Comment 15 GitLab Migration User 2019-09-25 18:29:31 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1341.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.