Bug 110701

Summary: GPU faults in in Unigine Valley 1.0
Product: Mesa Reporter: Eero Tamminen <eero.t.tamminen>
Component: Drivers/Gallium/radeonsiAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact: Default DRI bug account <dri-devel>
Severity: normal    
Priority: medium CC: cwidmer, gw.fossdev, kyle.devir, lonewolf, sarnex
Version: git   
Hardware: Other   
OS: All   
See Also: https://bugs.freedesktop.org/show_bug.cgi?id=108824
Whiteboard:
i915 platform: i915 features:
Attachments: likely fix

Description Eero Tamminen 2019-05-17 11:58:27 UTC
Setup:
- FullHD monitor (through HDMI KVM)
- HadesCanyon KBL i7-8809G ([AMD/ATI] Vega [Radeon RX Vega M] (rev c0))
- Ubuntu 18.04
- drm-tip git kernel v5.1
- Last VegaM firmware from kernel git
- X server git version
- Unigine Valley 1.0
- Mesa:
OpenGL renderer string: AMD VEGAM (DRM 3.32.0, 5.1.0, LLVM 7.0.0)
OpenGL core profile version string: 4.5 (Core Profile) Mesa 19.2.0-devel (git-d65b160e6a)

Test-cases:
* Several runs of:
  bin/valley_x64 -project_name Valley -data_path ../ -engine_config ../data/valley_1.0.cfg -system_script valley/unigine.cpp -video_app opengl -sound_app null -video_mode -1 -video_fullscreen 1 -video_multisample 0 -video_width 1920 -video_height 1080 -extern_define ,BENCHMARK,RELEASE,LANGUAGE_EN,QUALITY_HIGH 

Expected outcome:
* No GPU issues (same as earlier, e.g. with yesterday's Mesa a9cef4f0e5)

Actual outcome:
* On last of 3 runs:
-----------------------------------------------
[  451.020091] amdgpu 0000:01:00.0: GPU fault detected: 147 0x0b618802 for process valley_x64 pid 2048 thread valley_x64:cs0 pid 2066
[  451.020093] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00044F6C
[  451.020093] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06188002
[  451.020095] amdgpu 0000:01:00.0: VM fault (0x02, vmid 3, pasid 32772) at page 282476, read from 'TC0' (0x54433000) (392)
[  451.020101] amdgpu 0000:01:00.0: GPU fault detected: 147 0x0b610402 for process valley_x64 pid 2048 thread valley_x64:cs0 pid 2066
[  451.020102] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[  451.020102] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0608400C
[  451.020103] amdgpu 0000:01:00.0: VM fault (0x0c, vmid 3, pasid 32772) at page 0, read from 'TC5' (0x54433500) (132)
[  451.020109] amdgpu 0000:01:00.0: GPU fault detected: 147 0x0b698402 for process valley_x64 pid 2048 thread valley_x64:cs0 pid 2066
[  451.020110] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000002
[  451.020110] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0618800C
[  451.020111] amdgpu 0000:01:00.0: VM fault (0x0c, vmid 3, pasid 32772) at page 2, read from 'TC0' (0x54433000) (392)
[  451.020468] amdgpu 0000:01:00.0: GPU fault detected: 147 0x0ac90802 for process valley_x64 pid 2048 thread valley_x64:cs0 pid 2066
[  451.020469] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[  451.020470] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06104002
[  451.020471] amdgpu 0000:01:00.0: VM fault (0x02, vmid 3, pasid 32772) at page 0, read from 'TC3' (0x54433300) (260)
[  451.020476] amdgpu 0000:01:00.0: GPU fault detected: 147 0x00108402 for process valley_x64 pid 2048 thread valley_x64:cs0 pid 2066
[  451.020477] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00046DD2
[  451.020477] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06088002
[  451.020478] amdgpu 0000:01:00.0: VM fault (0x02, vmid 3, pasid 32772) at page 290258, read from 'TC4' (0x54433400) (136)
[  451.020484] amdgpu 0000:01:00.0: GPU fault detected: 147 0x0ad90402 for process valley_x64 pid 2048 thread valley_x64:cs0 pid 2066
[  451.020484] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x000460A8
[  451.020485] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06008002
[  451.020486] amdgpu 0000:01:00.0: VM fault (0x02, vmid 3, pasid 32772) at page 286888, read from 'TC6' (0x54433600) (8)
[  451.020491] amdgpu 0000:01:00.0: GPU fault detected: 147 0x0c380402 for process valley_x64 pid 2048 thread valley_x64:cs0 pid 2066
[  451.020492] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00046A0F
[  451.020493] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06104002
[  451.020494] amdgpu 0000:01:00.0: VM fault (0x02, vmid 3, pasid 32772) at page 289295, read from 'TC3' (0x54433300) (260)
[  451.020499] amdgpu 0000:01:00.0: GPU fault detected: 147 0x0ac98402 for process valley_x64 pid 2048 thread valley_x64:cs0 pid 2066
[  451.020500] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00045EAA
[  451.020500] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06108002
[  451.020501] amdgpu 0000:01:00.0: VM fault (0x02, vmid 3, pasid 32772) at page 286378, read from 'TC2' (0x54433200) (264)
[  451.020507] amdgpu 0000:01:00.0: GPU fault detected: 147 0x0ed88802 for process valley_x64 pid 2048 thread valley_x64:cs0 pid 2066
[  451.020507] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x000454BD
[  451.020508] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06108002
[  451.020509] amdgpu 0000:01:00.0: VM fault (0x02, vmid 3, pasid 32772) at page 283837, read from 'TC2' (0x54433200) (264)
[  451.020514] amdgpu 0000:01:00.0: GPU fault detected: 147 0x0be80402 for process valley_x64 pid 2048 thread valley_x64:cs0 pid 2066
[  451.020515] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0004732D
[  451.020516] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06088002
[  451.020516] amdgpu 0000:01:00.0: VM fault (0x02, vmid 3, pasid 32772) at page 291629, read from 'TC4' (0x54433400) (136)
[  451.025291] amdgpu 0000:01:00.0: IH ring buffer overflow (0x0008D420, 0x00006CE0, 0x0000D430)
[  451.521948] [drm] Fence fallback timer expired on ring sdma0
-----------------------------------------------

This is only time I've seen this so far -> it's possible that it requires specific Mesa & kernel version.
Comment 1 Christian Widmer 2019-05-18 00:22:26 UTC
Having a similar issue where using OpenGL in certain applications (specifically encountered the problem in Anki with hardware acceleration and mpv with the x11egl context and vaapi hardware decoding) causes dmesg being filled with GPU faults, I decided to bisect mesa and was able to identify commit [1] as being the culprit. Reverting that one makes the issue disappear for me. Does this, by any chance, solve the issue for you as well?

[1] [78e35df52aa2f7d770f929a0866a0faa89c261a9] radeonsi: update buffer descriptors in all contexts after buffer invalidation
Comment 2 Christian Widmer 2019-05-18 00:37:27 UTC
I should probably mention that I am using a Radeon RX580 on Gentoo with kernel 5.1.3 and a pretty recent git version of LLVM.
Comment 3 Yury Zhuravlev 2019-05-18 01:48:02 UTC
Similar, after the update today I have the same errors even for KDE KWin. 
Looks like Marek something did.

Error:
[  240.649210] amdgpu 0000:1f:00.0: [gfxhub] no-retry page fault (src_id:0 ring:24 vmid:7 pasid:32769, for process systemsettings5 pid 12567 thread systemsett:cs0 pid 12571)                                                                                                               
[  240.649211] amdgpu 0000:1f:00.0:   in page starting at address 0x0000800100be6000 from 27
[  240.649212] amdgpu 0000:1f:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00701031
[  240.649215] amdgpu 0000:1f:00.0: [gfxhub] no-retry page fault (src_id:0 ring:24 vmid:7 pasid:32769, for process systemsettings5 pid 12567 thread systemsett:cs0 pid 12571)                                                                                                               
[  240.649216] amdgpu 0000:1f:00.0:   in page starting at address 0x0000800100bf3000 from 27
[  240.649217] amdgpu 0000:1f:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  240.649220] amdgpu 0000:1f:00.0: [gfxhub] no-retry page fault (src_id:0 ring:24 vmid:7 pasid:32769, for process systemsettings5 pid 12567 thread systemsett:cs0 pid 12571)

and etc. 

My spec:
OpenGL renderer string: Radeon RX Vega (VEGA10, DRM 3.30.0, 5.1.0-gentoo, LLVM 9.0.0)
OpenGL core profile version string: 4.5 (Core Profile) Mesa 19.2.0-devel (git-28c2ce7105)
Comment 4 Yury Zhuravlev 2019-05-18 03:33:01 UTC
I made bisect and I found bad commit it's here https://cgit.freedesktop.org/mesa/mesa/commit/?id=4549c3678865236216952f649fa5ed0115fe81b9

can you try to build mesa for previous commit? Like 6b3343e5d80abf162b45f0d7e977449588824706 

I think we need to change the title of this bug.
Comment 5 Yury Zhuravlev 2019-05-18 03:53:10 UTC
> can you try to build mesa for previous commit? Like
> 6b3343e5d80abf162b45f0d7e977449588824706 
> 
> I think we need to change the title of this bug.

sorry, it's also unstable, but I can't reproduce error easily. 
Anyway, try commit before Marek patches 	d65b160e6a8712a33d72bea1a1b49587d483a18a
Comment 6 Yury Zhuravlev 2019-05-18 11:43:17 UTC
Ok, currently I know bug somewhere in this 3 commits
f3ae455eb08e8d718b828eb42f2529437916179b radeonsi: compute culling - flush CS to remove write references to buffers
0f1b070bad34c46c4bcc6c679fa533bf6b4b79e5 radeonsi: remove old_va parameter from si_rebind_buffer by remembering offsets
78e35df52aa2f7d770f929a0866a0faa89c261a9 radeonsi: update buffer descriptors in all contexts after buffer invalidation

I will test more. Looks like some commit after current makes this bug more reproducible. Before it also exists but not so often.
Comment 7 Christian Widmer 2019-05-19 04:15:39 UTC
(In reply to Yury Zhuravlev from comment #6)
> 78e35df52aa2f7d770f929a0866a0faa89c261a9 radeonsi: update buffer descriptors
> in all contexts after buffer invalidation

That is the commit I identified in comment #1 as being responsible for my issues. I would not be surprised if reverting that one makes your faults disappear as well.
Comment 8 Yury Zhuravlev 2019-05-19 09:50:00 UTC
(In reply to Christian Widmer from comment #7)
> (In reply to Yury Zhuravlev from comment #6)
> > 78e35df52aa2f7d770f929a0866a0faa89c261a9 radeonsi: update buffer descriptors
> > in all contexts after buffer invalidation
> 
> That is the commit I identified in comment #1 as being responsible for my
> issues. I would not be surprised if reverting that one makes your faults
> disappear as well.

Unfortunately no, I have this issue even without that commit but not so strong.
Comment 9 Mariusz Ceier 2019-05-19 10:19:41 UTC
(In reply to Yury Zhuravlev from comment #5)
> > can you try to build mesa for previous commit? Like
> > 6b3343e5d80abf162b45f0d7e977449588824706 
> > 
> > I think we need to change the title of this bug.
> 
> sorry, it's also unstable, but I can't reproduce error easily. 

Opening firefox private window causes this error every time (built with --enable-webrender and --enable-rust-simd, not sure if it makes difference).
Comment 10 Yury Zhuravlev 2019-05-19 15:52:14 UTC
(In reply to Mariusz Ceier from comment #9)
> (In reply to Yury Zhuravlev from comment #5)
> > > can you try to build mesa for previous commit? Like
> > > 6b3343e5d80abf162b45f0d7e977449588824706 
> > > 
> > > I think we need to change the title of this bug.
> > 
> > sorry, it's also unstable, but I can't reproduce error easily. 
> 
> Opening firefox private window causes this error every time (built with
> --enable-webrender and --enable-rust-simd, not sure if it makes difference).

Currently, I am on 04122532e3c06260ae889a4f6a28d6f9849b00f5 and it's stable for me. Can you check this one? 
(I have no modern firefox yet)
Comment 11 Mariusz Ceier 2019-05-19 17:48:43 UTC
(In reply to Yury Zhuravlev from comment #10)
> (In reply to Mariusz Ceier from comment #9)
> > (In reply to Yury Zhuravlev from comment #5)
> > > > can you try to build mesa for previous commit? Like
> > > > 6b3343e5d80abf162b45f0d7e977449588824706 
> > > > 
> > > > I think we need to change the title of this bug.
> > > 
> > > sorry, it's also unstable, but I can't reproduce error easily. 
> > 
> > Opening firefox private window causes this error every time (built with
> > --enable-webrender and --enable-rust-simd, not sure if it makes difference).
> 
> Currently, I am on 04122532e3c06260ae889a4f6a28d6f9849b00f5 and it's stable
> for me. Can you check this one? 
> (I have no modern firefox yet)

Just tried it and the error doesn't happen.
Comment 12 LoneVVolf 2019-05-19 23:19:49 UTC
I get a similar bug when running knetwalk[1] . As soon as the application gets focus, there's visual corruption in its window. If I move the mouse away the corruption (and messafes) are gone.
Running mesa-git built an hour ago on a RX 580 .
Will try to verify which of the the commits mentioned matter tomorrow.


[1] https://kde.org/applications/games/knetwalk/


dmesg snippet
[ 1642.706004] amdgpu 0000:42:00.0: GPU fault detected: 146 0x0e08040c for process knetwalk pid 2647 thread knetwalk:cs0 pid 2656
[ 1642.706010] amdgpu 0000:42:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00100BC1
[ 1642.706012] amdgpu 0000:42:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E00400C
[ 1642.706016] amdgpu 0000:42:00.0: VM fault (0x0c, vmid 7, pasid 32772) at page 1051585, read from 'TC1' (0x54433100) (4)
[ 1642.706074] amdgpu 0000:42:00.0: GPU fault detected: 146 0x0c38440c for process knetwalk pid 2647 thread knetwalk:cs0 pid 2656
[ 1642.706078] amdgpu 0000:42:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00100B87
[ 1642.706080] amdgpu 0000:42:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E04400C
[ 1642.706082] amdgpu 0000:42:00.0: VM fault (0x0c, vmid 7, pasid 32772) at page 1051527, read from 'TC5' (0x54433500) (68)
[ 1642.706087] amdgpu 0000:42:00.0: GPU fault detected: 146 0x0c38480c for process knetwalk pid 2647 thread knetwalk:cs0 pid 2656
[ 1642.706089] amdgpu 0000:42:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00100B9D
[ 1642.706090] amdgpu 0000:42:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E04400C
[ 1642.706093] amdgpu 0000:42:00.0: VM fault (0x0c, vmid 7, pasid 32772) at page 1051549, read from 'TC5' (0x54433500) (68)
[ 1642.706098] amdgpu 0000:42:00.0: GPU fault detected: 146 0x0c38c80c for process knetwalk pid 2647 thread knetwalk:cs0 pid 2656
[ 1642.706102] amdgpu 0000:42:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00100BE2
[ 1642.706104] amdgpu 0000:42:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E04400C
[ 1642.706106] amdgpu 0000:42:00.0: VM fault (0x0c, vmid 7, pasid 32772) at page 1051618, read from 'TC5' (0x54433500) (68)
[ 1642.706111] amdgpu 0000:42:00.0: GPU fault detected: 146 0x0c38c40c for process knetwalk pid 2647 thread knetwalk:cs0 pid 2656
[ 1642.706113] amdgpu 0000:42:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00100BD0
[ 1642.706115] amdgpu 0000:42:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E0C800C
Comment 13 LoneVVolf 2019-05-19 23:30:41 UTC
https://bugs.freedesktop.org/show_bug.cgi?id=108824 appears to be related
Comment 14 Kyle De'Vir 2019-05-20 00:24:02 UTC
78e35df52aa2f7d770f929a0866a0faa89c261a9 confirmed as the first bad commit for me.

It was causing text corruption in Plasma Shell, and random visual corruption in Blender-git, KDE System Settings, VSCode.

Immediate prior commit showed no issues at all.
Comment 15 Kyle De'Vir 2019-05-20 00:27:08 UTC
Ah, also, recent LLVM master with RX580.
Comment 16 Yury Zhuravlev 2019-05-20 03:20:41 UTC
(In reply to kyle.devir from comment #15)
> Ah, also, recent LLVM master with RX580.

probably for Vega something break before but I can agree after that commit everything became much worst. 
How long was you testing commit before 78e35df52aa2f7d770f929a0866a0faa89c261a9 ?
Comment 17 Kyle De'Vir 2019-05-20 03:38:22 UTC
A few seconds.

The corruption was almost immediately visible to me. I could trigger the glitches with little effort.

Prior commit is fine, though. I've been using 0f1b070bad34c46c4bcc6c679fa533bf6b4b79e5 without a single visual glitch happening, thus far.

Vega 10 seems affected in a different way, for some reason.
Comment 18 Christian Widmer 2019-05-20 03:46:53 UTC
(In reply to kyle.devir from comment #17)
> Vega 10 seems affected in a different way, for some reason.

It does indeed seem like that is the case. I have a RX580 like you and for me it is even enough to only revert commit 78e35df52aa2f7d770f929a0866a0faa89c261a9. Even later commits do not seem to cause any problems for me as I have been using mesa (on git-28c2ce7105) with only that single commit patched out without issues for two days.
Comment 19 Gert Wollny 2019-05-21 07:08:08 UTC
*** Bug 110717 has been marked as a duplicate of this bug. ***
Comment 20 Eero Tamminen 2019-05-21 09:44:13 UTC
There were several days when I didn't see this problem, but now I got it triggered once again.  I.e. it seems to happen very rarely, so far only twice in 30 runs of Valley (done on different days / different graphics stack git versions) => It would be better for some fully reproducible case to be used as as main bug (e.g. one from comment 9) instead of this one.
Comment 21 Marek Olšák 2019-05-21 18:36:19 UTC
Created attachment 144313 [details] [review]
likely fix

This patch should fix it. Thanks to Pierre-Eric for inspiring it.
Comment 22 Christian Widmer 2019-05-21 21:38:17 UTC
(In reply to Marek Olšák from comment #21)
> Created attachment 144313 [details] [review] [review]
> likely fix
> 
> This patch should fix it. Thanks to Pierre-Eric for inspiring it.

I can confirm that this patch indeed seems to fix the issue for me. At least my testcases cannot reproduce it as easily with this patch as they could without it on my RX580. Hopefully it will fix the problems for the Vega owners as well.
Comment 23 Marek Olšák 2019-05-21 23:19:05 UTC
Fixed by d6053bf2a170a0fec6d232fda097d2f35f0e9eae. Closing.
Comment 24 Yury Zhuravlev 2019-05-22 00:19:51 UTC
(In reply to Marek Olšák from comment #23)
> Fixed by d6053bf2a170a0fec6d232fda097d2f35f0e9eae. Closing.

The original issue was about Vega and on Vega we saw a different problem. I suppose before close issue somebody should check patch on Vega. 
I will do it soon.
Comment 25 Christoph Haag 2019-05-23 20:01:33 UTC
(In reply to Yury Zhuravlev from comment #24)
> (In reply to Marek Olšák from comment #23)
> > Fixed by d6053bf2a170a0fec6d232fda097d2f35f0e9eae. Closing.
> 
> The original issue was about Vega and on Vega we saw a different problem. I
> suppose before close issue somebody should check patch on Vega. 
> I will do it soon.

Since nobody responded: On a Vega 64 I got GPU faults like the ones posted here followed by a GPU hang immediately when restoring a firefox (nightly) session. With mesa master this does not happen anymore.
Comment 26 Yury Zhuravlev 2019-05-24 00:08:15 UTC
(In reply to Christoph Haag from comment #25)
> (In reply to Yury Zhuravlev from comment #24)
> > (In reply to Marek Olšák from comment #23)
> > > Fixed by d6053bf2a170a0fec6d232fda097d2f35f0e9eae. Closing.
> > 
> > The original issue was about Vega and on Vega we saw a different problem. I
> > suppose before close issue somebody should check patch on Vega. 
> > I will do it soon.
> 
> Since nobody responded: On a Vega 64 I got GPU faults like the ones posted
> here followed by a GPU hang immediately when restoring a firefox (nightly)
> session. With mesa master this does not happen anymore.

I agree, everything fine now. Vega56

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.