111231 – random VM_L2_PROTECTION_FAULTs when loading a world in minetest on AMD ryzen 2200G integrated graphics

Bug 111231 - random VM_L2_PROTECTION_FAULTs when loading a world in minetest on AMD ryzen 2200G integrated graphics

Summary: random VM_L2_PROTECTION_FAULTs when loading a world in minetest on AMD ryzen ...

Status:	RESOLVED FIXED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/Gallium/radeonsi (show other bugs)
Version:	git
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium major
Assignee:	Default DRI bug account
QA Contact:	Default DRI bug account

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2019-07-27 13:12 UTC by deltasquared
Modified:	2019-08-10 13:18 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:
i915 features:

Attachments
API trace that can reliably cause GPU protection faults on a ryzen 2200G (23.62 MB, application/octet-stream) 2019-07-27 13:15 UTC, deltasquared	Details
apitrace replay --verbose --debug: stdout (112 bytes, text/plain) 2019-07-27 13:18 UTC, deltasquared	Details
apitrace replay --verbose --debug: stderr (8.05 KB, text/plain) 2019-07-27 13:20 UTC, deltasquared	Details
apitrace replay --verbose --debug: stdout (17.64 MB, text/plain) 2019-07-27 13:24 UTC, deltasquared	Details
apitrace replay --verbose --debug: stderr (11.34 KB, text/plain) 2019-07-27 13:26 UTC, deltasquared	Details
dmesg log from boot to running apitrace replay on the above apitrace trace file (137.97 KB, text/plain) 2019-07-27 13:30 UTC, deltasquared	Details
dmesg output from boot to stopping dmesg when killing xorg was possible (288.37 KB, text/plain) 2019-07-27 13:34 UTC, deltasquared	Details
Observed graphical corruption - left hand side of monitor (1.26 MB, image/jpeg) 2019-07-27 13:49 UTC, deltasquared	Details
Observed graphical corruption - right hand side of monitor (941.21 KB, image/jpeg) 2019-07-27 13:50 UTC, deltasquared	Details
Show Obsolete (2) View All

Description deltasquared 2019-07-27 13:12:20 UTC

When playing minetest on an AMD ryzen 2200G with vega integrated graphics, occasionally the system will appear to suffer a graphics lock-up during game load when the loading bar appears.
When this occours, dmesg spits out a VM_L2_PROTECTION_FAULT and then repeated errors about fence timeouts:

[ 5699.136659] amdgpu 0000:0b:00.0: [gfxhub] no-retry page fault (src_id:0 ring:155 vmid:5 pasid:32770, for process minetest pid 7127 thread minetest:cs0 pid 7133)
[ 5699.136662] amdgpu 0000:0b:00.0:   in page starting at address 0x000080014034d000 from 27
[ 5699.136664] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00501136
[ 5704.343299] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out.
[ 5709.259775] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=443165, emitted seq=443167
[ 5709.259860] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process minetest pid 7127 thread minetest:cs0 pid 7133
[ 5709.259862] [drm] GPU recovery disabled.
[ 5709.463238] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out.
[ 5719.286451] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=443165, emitted seq=443167
[ 5719.286537] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process minetest pid 7127 thread minetest:cs0 pid 7133
[ 5719.286539] [drm] GPU recovery disabled.
[ 5729.312836] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=443165, emitted seq=443167
[ 5729.312921] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process minetest pid 7127 thread minetest:cs0 pid 7133
[ 5729.312923] [drm] GPU recovery disabled.
[ 5739.339485] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=443165, emitted seq=443167
[ 5739.339570] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process minetest pid 7127 thread minetest:cs0 pid 7133
[ 5739.339572] [drm] GPU recovery disabled.
[ 5749.366552] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=443165, emitted seq=443167
[ 5749.366637] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process minetest pid 7127 thread minetest:cs0 pid 7133
[ 5749.366640] [drm] GPU recovery disabled.

Notably, when playing minetest normally, this doesn't always happen, but when it does the screen gets a light covering of graphical corruption "confetti" (photos to follow - had to be taken on a phone, sorry).
Currently running a mesa debug build compiled from git at commit b0626c1f306 after seeing if https://bugs.freedesktop.org/show_bug.cgi?id=105251 had anything to do with it - I think this is related but not entirely a duplicate, as a fix mentioned there did stop the test program there from having an effect but did not stop this problem.

In the course of trying to reproduce this problem in a more repeatable manner, I decided to take an apitrace (will attach in following messages). Interestingly, the brief trace I took did not crash my system during recording of it, but now replaying it will fairly regularly cause the same kind of lockup, more frequently than the game itself will.
I ran apitrace replay in verbose mode to see whereabouts it stopped to see if this gave an approximate indications of where things starting going pear shaped.  The point at which output ends is well short of the entire apitrace dump, as expected from what I saw - and additionally the stderr appears to contain an exception of some kind. See the apitrace.out.txt and apitrace.err.txt attachments (to follow separately).

I haven't yet got a dmesg output during minetest running itself, but I have got some runs (spanning from boot to either hard or soft reboot - sometimes xorg was killable, othertimes not) from replaying the offending api trace. These will also be attached in follow-up messages.
These appear to have a lot more GPU faults before the messages about timeouts appear.

Comment 1 deltasquared 2019-07-27 13:15:14 UTC

Created attachment 144882 [details]
API trace that can reliably cause GPU protection faults on a ryzen 2200G

The adformentioned "dodgy" apitrace trace file.

Comment 2 deltasquared 2019-07-27 13:18:50 UTC

Created attachment 144883 [details]
apitrace replay --verbose --debug: stdout

NB: stderr attached separately.
Note that it stops after a certain swap buffers call, so I can only guess something occurred leading up to that which would cause difficulty.

I note there are some attrib pointer calls in-between that and the previous swap, which from my understanding of bug 105251 was one thing that could cause crashes - however while that test program was fixed in the git build, this issue was not.
I lack the knowledge to spot which particular call is the bad one though.

Comment 3 deltasquared 2019-07-27 13:20:28 UTC

Created attachment 144884 [details]
apitrace replay --verbose --debug: stderr

stderr of the same as above.
I made them separate as it helped me to have a look through them - though notably the stack trace I see at the end of this stderr output can't be placed in relation to stdout now, so if need be I can re-run the offending replay file with both redirected to the same file.

Comment 4 deltasquared 2019-07-27 13:22:39 UTC

Oh dear, it seems I'm getting in a bit of a muddle with the attachments, please bear with.

Comment 5 deltasquared 2019-07-27 13:24:50 UTC

Created attachment 144885 [details]
apitrace replay --verbose --debug: stdout

NB: stderr attached separately.
Note that it stops after a certain swap buffers call, so I can only guess something occurred leading up to that which would cause difficulty.

I note there are some attrib pointer calls in-between that and the previous swap, which from my understanding of bug 105251 was one thing that could cause crashes - however while that test program was fixed in the git build, this issue was not.
I lack the knowledge to spot which particular call is the bad one though.

Comment 6 deltasquared 2019-07-27 13:26:10 UTC

Created attachment 144886 [details]
apitrace replay --verbose --debug: stderr

stderr of the same as above.
I made them separate as it helped me to have a look through them - though notably the stack trace I see at the end of this stderr output can't be placed in relation to stdout now, so if need be I can re-run the offending replay file with both redirected to the same file.

Comment 7 deltasquared 2019-07-27 13:30:23 UTC

Created attachment 144887 [details]
dmesg log from boot to running apitrace replay on the above apitrace trace file

Notably there are a lot more "VM_L2_PROTECTION_FAULT_STATUS: ..." messages when replaying (this file) vs the original dmesg output (when I was able to hit the bug playing the game itself) in the main bug description.

Comment 8 deltasquared 2019-07-27 13:34:23 UTC

Created attachment 144888 [details]
dmesg output from boot to stopping dmesg when killing xorg was possible

In this case I was able to kill xorg and return to the linux console. When this happens the protection faults continue in dmesg but the pid and thread id values go to zero, not sure if this is significant.
This particular dmesg output accompanies the attached apitrace stdout/stderr files from that replay run.

Comment 9 deltasquared 2019-07-27 13:49:00 UTC

Created attachment 144889 [details]
Observed graphical corruption - left hand side of monitor

Taken in two photos as my screen's a rather large one.
This graphical corruption appears along the edges of some objects, which can *sometimes* occur either when running minetest directly and loading a world or replaying the above api trace.
However, I notice that sometimes no graphical corruption occors whatsoever but everything still freezes. That's on top of the fact that the freeze itself doesn't happen all the time... suggests something highly indetermistic at play?
Sometimes it flickers to a similarly corrupted version of the minetest logo before the freeze, haven't caught that on camera yet.

Comment 10 deltasquared 2019-07-27 13:50:21 UTC

Created attachment 144890 [details]
Observed graphical corruption - right hand side of monitor

Other side of above.
The sky colour is otherwise undisturbed to the top edge of the monitor, hence why the top edge was not in shot. I doubt this camera would have picked up enough detail otherwise - again it's a fairly large monitor.

Comment 11 deltasquared 2019-07-27 13:57:28 UTC

Some additional information I had neglected to mention in the initial description in the "excitement" of filing my first bug here...

Relevant hardware is as stated a ryzen 2200G running solely on integrated vega graphics - I haven't mentioned any other specs of the system as this bug has persisted across replacements of all components, even the motherboard - only things that have not changed are the PSU, nvme storage and the ryzen chip itself.

Distro is arch linux with all packages up to date at the time of writing. Kernel version 5.2.1-arch1-1-ARCH. Mesa built-from-git version mentonied in bug description. LLVM version 8.0.1.

Any other information is available on request.

Comment 12 Pierre-Eric Pelloux-Prayer 2019-07-29 14:11:11 UTC

Thanks for the bug report.

I could reproduce the bug using the provided apitrace, both on a Ryzen platform and on a Vega Mobile laptop (can't reproduce on Navi).

Using MESA_DEBUG=flush or AMD_DEBUG=check_vm seem to make the problem go away so my guess would be a synchronization / cache issue but I didn't find the root issue yet.

Comment 13 Pierre-Eric Pelloux-Prayer 2019-07-29 15:56:40 UTC

Using AMD_DEBUG=nodpbb "fixes" the problem.

Comment 14 deltasquared 2019-07-29 19:15:56 UTC

The apitrace no longer causes issues on my system either if I use AMD_DEBUG=nodpbb . I also decided to try this on minetest and *so far* (bearing in mind the issue was indetermistic in the first place, so a decisive ruling is near impossible) I have not re-incurred a crash.

Interestingly, what I have noticed is that sometimes when minetest did not lock up my system before, the loading bar would suffer mild graphical corruption (bits of the black border go white) - quite difficult to capture on camera due to being so fleeting. So far with nodpbb I have yet to observe these artefacts again.

I did try launching a minetest world with AMD_DEBUG=check_vm instead, however I somehow still managed to get a lock-up that way with similar graphical corruption as the bug description. Alas it seems my btrfs root decided to eat my dmesg log file when I had to force power off, so unable to see if it was the dreaded VM_L2_PROTECTION_FAULT again >:(

Comment 15 Pierre-Eric Pelloux-Prayer 2019-08-05 13:16:37 UTC

Could you test the branch from MR https://gitlab.freedesktop.org/mesa/mesa/merge_requests/1554 and let me know if it fixes the issue for you?

Comment 16 Pierre-Eric Pelloux-Prayer 2019-08-08 13:54:30 UTC

The MR has been merged.

Thanks for your help!

Comment 17 deltasquared 2019-08-10 13:18:20 UTC

Apologies for being late to reply.
Having run mesa built from the MR branch, I have since been unable to get the same crash when running minetest.
Certainly the apitrace capture can no longer bring my system down, however the actual program running was always less determistic than that, so it was hard to prove the absence of - that said, I have been playing the game again for a few days now and have not experienced the crash, so I feel reasonably comfortable it has gone.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.