Summary: | GPU faults in in Unigine Valley 1.0 | ||
---|---|---|---|
Product: | Mesa | Reporter: | Eero Tamminen <eero.t.tamminen> |
Component: | Drivers/Gallium/radeonsi | Assignee: | Default DRI bug account <dri-devel> |
Status: | RESOLVED FIXED | QA Contact: | Default DRI bug account <dri-devel> |
Severity: | normal | ||
Priority: | medium | CC: | cwidmer, gw.fossdev, kyle.devir, lonewolf, sarnex |
Version: | git | ||
Hardware: | Other | ||
OS: | All | ||
See Also: | https://bugs.freedesktop.org/show_bug.cgi?id=108824 | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: | likely fix |
Description
Eero Tamminen
2019-05-17 11:58:27 UTC
Having a similar issue where using OpenGL in certain applications (specifically encountered the problem in Anki with hardware acceleration and mpv with the x11egl context and vaapi hardware decoding) causes dmesg being filled with GPU faults, I decided to bisect mesa and was able to identify commit [1] as being the culprit. Reverting that one makes the issue disappear for me. Does this, by any chance, solve the issue for you as well? [1] [78e35df52aa2f7d770f929a0866a0faa89c261a9] radeonsi: update buffer descriptors in all contexts after buffer invalidation I should probably mention that I am using a Radeon RX580 on Gentoo with kernel 5.1.3 and a pretty recent git version of LLVM. Similar, after the update today I have the same errors even for KDE KWin. Looks like Marek something did. Error: [ 240.649210] amdgpu 0000:1f:00.0: [gfxhub] no-retry page fault (src_id:0 ring:24 vmid:7 pasid:32769, for process systemsettings5 pid 12567 thread systemsett:cs0 pid 12571) [ 240.649211] amdgpu 0000:1f:00.0: in page starting at address 0x0000800100be6000 from 27 [ 240.649212] amdgpu 0000:1f:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00701031 [ 240.649215] amdgpu 0000:1f:00.0: [gfxhub] no-retry page fault (src_id:0 ring:24 vmid:7 pasid:32769, for process systemsettings5 pid 12567 thread systemsett:cs0 pid 12571) [ 240.649216] amdgpu 0000:1f:00.0: in page starting at address 0x0000800100bf3000 from 27 [ 240.649217] amdgpu 0000:1f:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ 240.649220] amdgpu 0000:1f:00.0: [gfxhub] no-retry page fault (src_id:0 ring:24 vmid:7 pasid:32769, for process systemsettings5 pid 12567 thread systemsett:cs0 pid 12571) and etc. My spec: OpenGL renderer string: Radeon RX Vega (VEGA10, DRM 3.30.0, 5.1.0-gentoo, LLVM 9.0.0) OpenGL core profile version string: 4.5 (Core Profile) Mesa 19.2.0-devel (git-28c2ce7105) I made bisect and I found bad commit it's here https://cgit.freedesktop.org/mesa/mesa/commit/?id=4549c3678865236216952f649fa5ed0115fe81b9 can you try to build mesa for previous commit? Like 6b3343e5d80abf162b45f0d7e977449588824706 I think we need to change the title of this bug. > can you try to build mesa for previous commit? Like
> 6b3343e5d80abf162b45f0d7e977449588824706
>
> I think we need to change the title of this bug.
sorry, it's also unstable, but I can't reproduce error easily.
Anyway, try commit before Marek patches d65b160e6a8712a33d72bea1a1b49587d483a18a
Ok, currently I know bug somewhere in this 3 commits f3ae455eb08e8d718b828eb42f2529437916179b radeonsi: compute culling - flush CS to remove write references to buffers 0f1b070bad34c46c4bcc6c679fa533bf6b4b79e5 radeonsi: remove old_va parameter from si_rebind_buffer by remembering offsets 78e35df52aa2f7d770f929a0866a0faa89c261a9 radeonsi: update buffer descriptors in all contexts after buffer invalidation I will test more. Looks like some commit after current makes this bug more reproducible. Before it also exists but not so often. (In reply to Yury Zhuravlev from comment #6) > 78e35df52aa2f7d770f929a0866a0faa89c261a9 radeonsi: update buffer descriptors > in all contexts after buffer invalidation That is the commit I identified in comment #1 as being responsible for my issues. I would not be surprised if reverting that one makes your faults disappear as well. (In reply to Christian Widmer from comment #7) > (In reply to Yury Zhuravlev from comment #6) > > 78e35df52aa2f7d770f929a0866a0faa89c261a9 radeonsi: update buffer descriptors > > in all contexts after buffer invalidation > > That is the commit I identified in comment #1 as being responsible for my > issues. I would not be surprised if reverting that one makes your faults > disappear as well. Unfortunately no, I have this issue even without that commit but not so strong. (In reply to Yury Zhuravlev from comment #5) > > can you try to build mesa for previous commit? Like > > 6b3343e5d80abf162b45f0d7e977449588824706 > > > > I think we need to change the title of this bug. > > sorry, it's also unstable, but I can't reproduce error easily. Opening firefox private window causes this error every time (built with --enable-webrender and --enable-rust-simd, not sure if it makes difference). (In reply to Mariusz Ceier from comment #9) > (In reply to Yury Zhuravlev from comment #5) > > > can you try to build mesa for previous commit? Like > > > 6b3343e5d80abf162b45f0d7e977449588824706 > > > > > > I think we need to change the title of this bug. > > > > sorry, it's also unstable, but I can't reproduce error easily. > > Opening firefox private window causes this error every time (built with > --enable-webrender and --enable-rust-simd, not sure if it makes difference). Currently, I am on 04122532e3c06260ae889a4f6a28d6f9849b00f5 and it's stable for me. Can you check this one? (I have no modern firefox yet) (In reply to Yury Zhuravlev from comment #10) > (In reply to Mariusz Ceier from comment #9) > > (In reply to Yury Zhuravlev from comment #5) > > > > can you try to build mesa for previous commit? Like > > > > 6b3343e5d80abf162b45f0d7e977449588824706 > > > > > > > > I think we need to change the title of this bug. > > > > > > sorry, it's also unstable, but I can't reproduce error easily. > > > > Opening firefox private window causes this error every time (built with > > --enable-webrender and --enable-rust-simd, not sure if it makes difference). > > Currently, I am on 04122532e3c06260ae889a4f6a28d6f9849b00f5 and it's stable > for me. Can you check this one? > (I have no modern firefox yet) Just tried it and the error doesn't happen. I get a similar bug when running knetwalk[1] . As soon as the application gets focus, there's visual corruption in its window. If I move the mouse away the corruption (and messafes) are gone. Running mesa-git built an hour ago on a RX 580 . Will try to verify which of the the commits mentioned matter tomorrow. [1] https://kde.org/applications/games/knetwalk/ dmesg snippet [ 1642.706004] amdgpu 0000:42:00.0: GPU fault detected: 146 0x0e08040c for process knetwalk pid 2647 thread knetwalk:cs0 pid 2656 [ 1642.706010] amdgpu 0000:42:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00100BC1 [ 1642.706012] amdgpu 0000:42:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E00400C [ 1642.706016] amdgpu 0000:42:00.0: VM fault (0x0c, vmid 7, pasid 32772) at page 1051585, read from 'TC1' (0x54433100) (4) [ 1642.706074] amdgpu 0000:42:00.0: GPU fault detected: 146 0x0c38440c for process knetwalk pid 2647 thread knetwalk:cs0 pid 2656 [ 1642.706078] amdgpu 0000:42:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00100B87 [ 1642.706080] amdgpu 0000:42:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E04400C [ 1642.706082] amdgpu 0000:42:00.0: VM fault (0x0c, vmid 7, pasid 32772) at page 1051527, read from 'TC5' (0x54433500) (68) [ 1642.706087] amdgpu 0000:42:00.0: GPU fault detected: 146 0x0c38480c for process knetwalk pid 2647 thread knetwalk:cs0 pid 2656 [ 1642.706089] amdgpu 0000:42:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00100B9D [ 1642.706090] amdgpu 0000:42:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E04400C [ 1642.706093] amdgpu 0000:42:00.0: VM fault (0x0c, vmid 7, pasid 32772) at page 1051549, read from 'TC5' (0x54433500) (68) [ 1642.706098] amdgpu 0000:42:00.0: GPU fault detected: 146 0x0c38c80c for process knetwalk pid 2647 thread knetwalk:cs0 pid 2656 [ 1642.706102] amdgpu 0000:42:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00100BE2 [ 1642.706104] amdgpu 0000:42:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E04400C [ 1642.706106] amdgpu 0000:42:00.0: VM fault (0x0c, vmid 7, pasid 32772) at page 1051618, read from 'TC5' (0x54433500) (68) [ 1642.706111] amdgpu 0000:42:00.0: GPU fault detected: 146 0x0c38c40c for process knetwalk pid 2647 thread knetwalk:cs0 pid 2656 [ 1642.706113] amdgpu 0000:42:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00100BD0 [ 1642.706115] amdgpu 0000:42:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E0C800C https://bugs.freedesktop.org/show_bug.cgi?id=108824 appears to be related 78e35df52aa2f7d770f929a0866a0faa89c261a9 confirmed as the first bad commit for me. It was causing text corruption in Plasma Shell, and random visual corruption in Blender-git, KDE System Settings, VSCode. Immediate prior commit showed no issues at all. Ah, also, recent LLVM master with RX580. (In reply to kyle.devir from comment #15) > Ah, also, recent LLVM master with RX580. probably for Vega something break before but I can agree after that commit everything became much worst. How long was you testing commit before 78e35df52aa2f7d770f929a0866a0faa89c261a9 ? A few seconds. The corruption was almost immediately visible to me. I could trigger the glitches with little effort. Prior commit is fine, though. I've been using 0f1b070bad34c46c4bcc6c679fa533bf6b4b79e5 without a single visual glitch happening, thus far. Vega 10 seems affected in a different way, for some reason. (In reply to kyle.devir from comment #17) > Vega 10 seems affected in a different way, for some reason. It does indeed seem like that is the case. I have a RX580 like you and for me it is even enough to only revert commit 78e35df52aa2f7d770f929a0866a0faa89c261a9. Even later commits do not seem to cause any problems for me as I have been using mesa (on git-28c2ce7105) with only that single commit patched out without issues for two days. *** Bug 110717 has been marked as a duplicate of this bug. *** There were several days when I didn't see this problem, but now I got it triggered once again. I.e. it seems to happen very rarely, so far only twice in 30 runs of Valley (done on different days / different graphics stack git versions) => It would be better for some fully reproducible case to be used as as main bug (e.g. one from comment 9) instead of this one. Created attachment 144313 [details] [review] likely fix This patch should fix it. Thanks to Pierre-Eric for inspiring it. (In reply to Marek Olšák from comment #21) > Created attachment 144313 [details] [review] [review] > likely fix > > This patch should fix it. Thanks to Pierre-Eric for inspiring it. I can confirm that this patch indeed seems to fix the issue for me. At least my testcases cannot reproduce it as easily with this patch as they could without it on my RX580. Hopefully it will fix the problems for the Vega owners as well. Fixed by d6053bf2a170a0fec6d232fda097d2f35f0e9eae. Closing. (In reply to Marek Olšák from comment #23) > Fixed by d6053bf2a170a0fec6d232fda097d2f35f0e9eae. Closing. The original issue was about Vega and on Vega we saw a different problem. I suppose before close issue somebody should check patch on Vega. I will do it soon. (In reply to Yury Zhuravlev from comment #24) > (In reply to Marek Olšák from comment #23) > > Fixed by d6053bf2a170a0fec6d232fda097d2f35f0e9eae. Closing. > > The original issue was about Vega and on Vega we saw a different problem. I > suppose before close issue somebody should check patch on Vega. > I will do it soon. Since nobody responded: On a Vega 64 I got GPU faults like the ones posted here followed by a GPU hang immediately when restoring a firefox (nightly) session. With mesa master this does not happen anymore. (In reply to Christoph Haag from comment #25) > (In reply to Yury Zhuravlev from comment #24) > > (In reply to Marek Olšák from comment #23) > > > Fixed by d6053bf2a170a0fec6d232fda097d2f35f0e9eae. Closing. > > > > The original issue was about Vega and on Vega we saw a different problem. I > > suppose before close issue somebody should check patch on Vega. > > I will do it soon. > > Since nobody responded: On a Vega 64 I got GPU faults like the ones posted > here followed by a GPU hang immediately when restoring a firefox (nightly) > session. With mesa master this does not happen anymore. I agree, everything fine now. Vega56 |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.