Setup: - FullHD monitor (through HDMI KVM) - HadesCanyon KBL i7-8809G ([AMD/ATI] Vega [Radeon RX Vega M] (rev c0)) - Ubuntu 18.04 - drm-tip git kernel v5.1 - Last VegaM firmware from kernel git - X server git version - Unigine Valley 1.0 - Mesa: OpenGL renderer string: AMD VEGAM (DRM 3.32.0, 5.1.0, LLVM 7.0.0) OpenGL core profile version string: 4.5 (Core Profile) Mesa 19.2.0-devel (git-d65b160e6a) Test-cases: * Several runs of: bin/valley_x64 -project_name Valley -data_path ../ -engine_config ../data/valley_1.0.cfg -system_script valley/unigine.cpp -video_app opengl -sound_app null -video_mode -1 -video_fullscreen 1 -video_multisample 0 -video_width 1920 -video_height 1080 -extern_define ,BENCHMARK,RELEASE,LANGUAGE_EN,QUALITY_HIGH Expected outcome: * No GPU issues (same as earlier, e.g. with yesterday's Mesa a9cef4f0e5) Actual outcome: * On last of 3 runs: ----------------------------------------------- [ 451.020091] amdgpu 0000:01:00.0: GPU fault detected: 147 0x0b618802 for process valley_x64 pid 2048 thread valley_x64:cs0 pid 2066 [ 451.020093] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00044F6C [ 451.020093] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06188002 [ 451.020095] amdgpu 0000:01:00.0: VM fault (0x02, vmid 3, pasid 32772) at page 282476, read from 'TC0' (0x54433000) (392) [ 451.020101] amdgpu 0000:01:00.0: GPU fault detected: 147 0x0b610402 for process valley_x64 pid 2048 thread valley_x64:cs0 pid 2066 [ 451.020102] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 451.020102] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0608400C [ 451.020103] amdgpu 0000:01:00.0: VM fault (0x0c, vmid 3, pasid 32772) at page 0, read from 'TC5' (0x54433500) (132) [ 451.020109] amdgpu 0000:01:00.0: GPU fault detected: 147 0x0b698402 for process valley_x64 pid 2048 thread valley_x64:cs0 pid 2066 [ 451.020110] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000002 [ 451.020110] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0618800C [ 451.020111] amdgpu 0000:01:00.0: VM fault (0x0c, vmid 3, pasid 32772) at page 2, read from 'TC0' (0x54433000) (392) [ 451.020468] amdgpu 0000:01:00.0: GPU fault detected: 147 0x0ac90802 for process valley_x64 pid 2048 thread valley_x64:cs0 pid 2066 [ 451.020469] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 451.020470] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06104002 [ 451.020471] amdgpu 0000:01:00.0: VM fault (0x02, vmid 3, pasid 32772) at page 0, read from 'TC3' (0x54433300) (260) [ 451.020476] amdgpu 0000:01:00.0: GPU fault detected: 147 0x00108402 for process valley_x64 pid 2048 thread valley_x64:cs0 pid 2066 [ 451.020477] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00046DD2 [ 451.020477] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06088002 [ 451.020478] amdgpu 0000:01:00.0: VM fault (0x02, vmid 3, pasid 32772) at page 290258, read from 'TC4' (0x54433400) (136) [ 451.020484] amdgpu 0000:01:00.0: GPU fault detected: 147 0x0ad90402 for process valley_x64 pid 2048 thread valley_x64:cs0 pid 2066 [ 451.020484] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x000460A8 [ 451.020485] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06008002 [ 451.020486] amdgpu 0000:01:00.0: VM fault (0x02, vmid 3, pasid 32772) at page 286888, read from 'TC6' (0x54433600) (8) [ 451.020491] amdgpu 0000:01:00.0: GPU fault detected: 147 0x0c380402 for process valley_x64 pid 2048 thread valley_x64:cs0 pid 2066 [ 451.020492] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00046A0F [ 451.020493] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06104002 [ 451.020494] amdgpu 0000:01:00.0: VM fault (0x02, vmid 3, pasid 32772) at page 289295, read from 'TC3' (0x54433300) (260) [ 451.020499] amdgpu 0000:01:00.0: GPU fault detected: 147 0x0ac98402 for process valley_x64 pid 2048 thread valley_x64:cs0 pid 2066 [ 451.020500] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00045EAA [ 451.020500] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06108002 [ 451.020501] amdgpu 0000:01:00.0: VM fault (0x02, vmid 3, pasid 32772) at page 286378, read from 'TC2' (0x54433200) (264) [ 451.020507] amdgpu 0000:01:00.0: GPU fault detected: 147 0x0ed88802 for process valley_x64 pid 2048 thread valley_x64:cs0 pid 2066 [ 451.020507] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x000454BD [ 451.020508] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06108002 [ 451.020509] amdgpu 0000:01:00.0: VM fault (0x02, vmid 3, pasid 32772) at page 283837, read from 'TC2' (0x54433200) (264) [ 451.020514] amdgpu 0000:01:00.0: GPU fault detected: 147 0x0be80402 for process valley_x64 pid 2048 thread valley_x64:cs0 pid 2066 [ 451.020515] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0004732D [ 451.020516] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06088002 [ 451.020516] amdgpu 0000:01:00.0: VM fault (0x02, vmid 3, pasid 32772) at page 291629, read from 'TC4' (0x54433400) (136) [ 451.025291] amdgpu 0000:01:00.0: IH ring buffer overflow (0x0008D420, 0x00006CE0, 0x0000D430) [ 451.521948] [drm] Fence fallback timer expired on ring sdma0 ----------------------------------------------- This is only time I've seen this so far -> it's possible that it requires specific Mesa & kernel version.
Having a similar issue where using OpenGL in certain applications (specifically encountered the problem in Anki with hardware acceleration and mpv with the x11egl context and vaapi hardware decoding) causes dmesg being filled with GPU faults, I decided to bisect mesa and was able to identify commit [1] as being the culprit. Reverting that one makes the issue disappear for me. Does this, by any chance, solve the issue for you as well? [1] [78e35df52aa2f7d770f929a0866a0faa89c261a9] radeonsi: update buffer descriptors in all contexts after buffer invalidation
I should probably mention that I am using a Radeon RX580 on Gentoo with kernel 5.1.3 and a pretty recent git version of LLVM.
Similar, after the update today I have the same errors even for KDE KWin. Looks like Marek something did. Error: [ 240.649210] amdgpu 0000:1f:00.0: [gfxhub] no-retry page fault (src_id:0 ring:24 vmid:7 pasid:32769, for process systemsettings5 pid 12567 thread systemsett:cs0 pid 12571) [ 240.649211] amdgpu 0000:1f:00.0: in page starting at address 0x0000800100be6000 from 27 [ 240.649212] amdgpu 0000:1f:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00701031 [ 240.649215] amdgpu 0000:1f:00.0: [gfxhub] no-retry page fault (src_id:0 ring:24 vmid:7 pasid:32769, for process systemsettings5 pid 12567 thread systemsett:cs0 pid 12571) [ 240.649216] amdgpu 0000:1f:00.0: in page starting at address 0x0000800100bf3000 from 27 [ 240.649217] amdgpu 0000:1f:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ 240.649220] amdgpu 0000:1f:00.0: [gfxhub] no-retry page fault (src_id:0 ring:24 vmid:7 pasid:32769, for process systemsettings5 pid 12567 thread systemsett:cs0 pid 12571) and etc. My spec: OpenGL renderer string: Radeon RX Vega (VEGA10, DRM 3.30.0, 5.1.0-gentoo, LLVM 9.0.0) OpenGL core profile version string: 4.5 (Core Profile) Mesa 19.2.0-devel (git-28c2ce7105)
I made bisect and I found bad commit it's here https://cgit.freedesktop.org/mesa/mesa/commit/?id=4549c3678865236216952f649fa5ed0115fe81b9 can you try to build mesa for previous commit? Like 6b3343e5d80abf162b45f0d7e977449588824706 I think we need to change the title of this bug.
> can you try to build mesa for previous commit? Like > 6b3343e5d80abf162b45f0d7e977449588824706 > > I think we need to change the title of this bug. sorry, it's also unstable, but I can't reproduce error easily. Anyway, try commit before Marek patches d65b160e6a8712a33d72bea1a1b49587d483a18a
Ok, currently I know bug somewhere in this 3 commits f3ae455eb08e8d718b828eb42f2529437916179b radeonsi: compute culling - flush CS to remove write references to buffers 0f1b070bad34c46c4bcc6c679fa533bf6b4b79e5 radeonsi: remove old_va parameter from si_rebind_buffer by remembering offsets 78e35df52aa2f7d770f929a0866a0faa89c261a9 radeonsi: update buffer descriptors in all contexts after buffer invalidation I will test more. Looks like some commit after current makes this bug more reproducible. Before it also exists but not so often.
(In reply to Yury Zhuravlev from comment #6) > 78e35df52aa2f7d770f929a0866a0faa89c261a9 radeonsi: update buffer descriptors > in all contexts after buffer invalidation That is the commit I identified in comment #1 as being responsible for my issues. I would not be surprised if reverting that one makes your faults disappear as well.
(In reply to Christian Widmer from comment #7) > (In reply to Yury Zhuravlev from comment #6) > > 78e35df52aa2f7d770f929a0866a0faa89c261a9 radeonsi: update buffer descriptors > > in all contexts after buffer invalidation > > That is the commit I identified in comment #1 as being responsible for my > issues. I would not be surprised if reverting that one makes your faults > disappear as well. Unfortunately no, I have this issue even without that commit but not so strong.
(In reply to Yury Zhuravlev from comment #5) > > can you try to build mesa for previous commit? Like > > 6b3343e5d80abf162b45f0d7e977449588824706 > > > > I think we need to change the title of this bug. > > sorry, it's also unstable, but I can't reproduce error easily. Opening firefox private window causes this error every time (built with --enable-webrender and --enable-rust-simd, not sure if it makes difference).
(In reply to Mariusz Ceier from comment #9) > (In reply to Yury Zhuravlev from comment #5) > > > can you try to build mesa for previous commit? Like > > > 6b3343e5d80abf162b45f0d7e977449588824706 > > > > > > I think we need to change the title of this bug. > > > > sorry, it's also unstable, but I can't reproduce error easily. > > Opening firefox private window causes this error every time (built with > --enable-webrender and --enable-rust-simd, not sure if it makes difference). Currently, I am on 04122532e3c06260ae889a4f6a28d6f9849b00f5 and it's stable for me. Can you check this one? (I have no modern firefox yet)
(In reply to Yury Zhuravlev from comment #10) > (In reply to Mariusz Ceier from comment #9) > > (In reply to Yury Zhuravlev from comment #5) > > > > can you try to build mesa for previous commit? Like > > > > 6b3343e5d80abf162b45f0d7e977449588824706 > > > > > > > > I think we need to change the title of this bug. > > > > > > sorry, it's also unstable, but I can't reproduce error easily. > > > > Opening firefox private window causes this error every time (built with > > --enable-webrender and --enable-rust-simd, not sure if it makes difference). > > Currently, I am on 04122532e3c06260ae889a4f6a28d6f9849b00f5 and it's stable > for me. Can you check this one? > (I have no modern firefox yet) Just tried it and the error doesn't happen.
I get a similar bug when running knetwalk[1] . As soon as the application gets focus, there's visual corruption in its window. If I move the mouse away the corruption (and messafes) are gone. Running mesa-git built an hour ago on a RX 580 . Will try to verify which of the the commits mentioned matter tomorrow. [1] https://kde.org/applications/games/knetwalk/ dmesg snippet [ 1642.706004] amdgpu 0000:42:00.0: GPU fault detected: 146 0x0e08040c for process knetwalk pid 2647 thread knetwalk:cs0 pid 2656 [ 1642.706010] amdgpu 0000:42:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00100BC1 [ 1642.706012] amdgpu 0000:42:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E00400C [ 1642.706016] amdgpu 0000:42:00.0: VM fault (0x0c, vmid 7, pasid 32772) at page 1051585, read from 'TC1' (0x54433100) (4) [ 1642.706074] amdgpu 0000:42:00.0: GPU fault detected: 146 0x0c38440c for process knetwalk pid 2647 thread knetwalk:cs0 pid 2656 [ 1642.706078] amdgpu 0000:42:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00100B87 [ 1642.706080] amdgpu 0000:42:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E04400C [ 1642.706082] amdgpu 0000:42:00.0: VM fault (0x0c, vmid 7, pasid 32772) at page 1051527, read from 'TC5' (0x54433500) (68) [ 1642.706087] amdgpu 0000:42:00.0: GPU fault detected: 146 0x0c38480c for process knetwalk pid 2647 thread knetwalk:cs0 pid 2656 [ 1642.706089] amdgpu 0000:42:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00100B9D [ 1642.706090] amdgpu 0000:42:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E04400C [ 1642.706093] amdgpu 0000:42:00.0: VM fault (0x0c, vmid 7, pasid 32772) at page 1051549, read from 'TC5' (0x54433500) (68) [ 1642.706098] amdgpu 0000:42:00.0: GPU fault detected: 146 0x0c38c80c for process knetwalk pid 2647 thread knetwalk:cs0 pid 2656 [ 1642.706102] amdgpu 0000:42:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00100BE2 [ 1642.706104] amdgpu 0000:42:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E04400C [ 1642.706106] amdgpu 0000:42:00.0: VM fault (0x0c, vmid 7, pasid 32772) at page 1051618, read from 'TC5' (0x54433500) (68) [ 1642.706111] amdgpu 0000:42:00.0: GPU fault detected: 146 0x0c38c40c for process knetwalk pid 2647 thread knetwalk:cs0 pid 2656 [ 1642.706113] amdgpu 0000:42:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00100BD0 [ 1642.706115] amdgpu 0000:42:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E0C800C
https://bugs.freedesktop.org/show_bug.cgi?id=108824 appears to be related
78e35df52aa2f7d770f929a0866a0faa89c261a9 confirmed as the first bad commit for me. It was causing text corruption in Plasma Shell, and random visual corruption in Blender-git, KDE System Settings, VSCode. Immediate prior commit showed no issues at all.
Ah, also, recent LLVM master with RX580.
(In reply to kyle.devir from comment #15) > Ah, also, recent LLVM master with RX580. probably for Vega something break before but I can agree after that commit everything became much worst. How long was you testing commit before 78e35df52aa2f7d770f929a0866a0faa89c261a9 ?
A few seconds. The corruption was almost immediately visible to me. I could trigger the glitches with little effort. Prior commit is fine, though. I've been using 0f1b070bad34c46c4bcc6c679fa533bf6b4b79e5 without a single visual glitch happening, thus far. Vega 10 seems affected in a different way, for some reason.
(In reply to kyle.devir from comment #17) > Vega 10 seems affected in a different way, for some reason. It does indeed seem like that is the case. I have a RX580 like you and for me it is even enough to only revert commit 78e35df52aa2f7d770f929a0866a0faa89c261a9. Even later commits do not seem to cause any problems for me as I have been using mesa (on git-28c2ce7105) with only that single commit patched out without issues for two days.
*** Bug 110717 has been marked as a duplicate of this bug. ***
There were several days when I didn't see this problem, but now I got it triggered once again. I.e. it seems to happen very rarely, so far only twice in 30 runs of Valley (done on different days / different graphics stack git versions) => It would be better for some fully reproducible case to be used as as main bug (e.g. one from comment 9) instead of this one.
Created attachment 144313 [details] [review] likely fix This patch should fix it. Thanks to Pierre-Eric for inspiring it.
(In reply to Marek Olšák from comment #21) > Created attachment 144313 [details] [review] [review] > likely fix > > This patch should fix it. Thanks to Pierre-Eric for inspiring it. I can confirm that this patch indeed seems to fix the issue for me. At least my testcases cannot reproduce it as easily with this patch as they could without it on my RX580. Hopefully it will fix the problems for the Vega owners as well.
Fixed by d6053bf2a170a0fec6d232fda097d2f35f0e9eae. Closing.
(In reply to Marek Olšák from comment #23) > Fixed by d6053bf2a170a0fec6d232fda097d2f35f0e9eae. Closing. The original issue was about Vega and on Vega we saw a different problem. I suppose before close issue somebody should check patch on Vega. I will do it soon.
(In reply to Yury Zhuravlev from comment #24) > (In reply to Marek Olšák from comment #23) > > Fixed by d6053bf2a170a0fec6d232fda097d2f35f0e9eae. Closing. > > The original issue was about Vega and on Vega we saw a different problem. I > suppose before close issue somebody should check patch on Vega. > I will do it soon. Since nobody responded: On a Vega 64 I got GPU faults like the ones posted here followed by a GPU hang immediately when restoring a firefox (nightly) session. With mesa master this does not happen anymore.
(In reply to Christoph Haag from comment #25) > (In reply to Yury Zhuravlev from comment #24) > > (In reply to Marek Olšák from comment #23) > > > Fixed by d6053bf2a170a0fec6d232fda097d2f35f0e9eae. Closing. > > > > The original issue was about Vega and on Vega we saw a different problem. I > > suppose before close issue somebody should check patch on Vega. > > I will do it soon. > > Since nobody responded: On a Vega 64 I got GPU faults like the ones posted > here followed by a GPU hang immediately when restoring a firefox (nightly) > session. With mesa master this does not happen anymore. I agree, everything fine now. Vega56
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.