Description
deltasquared
2019-07-27 13:12:20 UTC
Created attachment 144882 [details]
API trace that can reliably cause GPU protection faults on a ryzen 2200G
The adformentioned "dodgy" apitrace trace file.
Created attachment 144883 [details] apitrace replay --verbose --debug: stdout NB: stderr attached separately. Note that it stops after a certain swap buffers call, so I can only guess something occurred leading up to that which would cause difficulty. I note there are some attrib pointer calls in-between that and the previous swap, which from my understanding of bug 105251 was one thing that could cause crashes - however while that test program was fixed in the git build, this issue was not. I lack the knowledge to spot which particular call is the bad one though. Created attachment 144884 [details]
apitrace replay --verbose --debug: stderr
stderr of the same as above.
I made them separate as it helped me to have a look through them - though notably the stack trace I see at the end of this stderr output can't be placed in relation to stdout now, so if need be I can re-run the offending replay file with both redirected to the same file.
Oh dear, it seems I'm getting in a bit of a muddle with the attachments, please bear with. Created attachment 144885 [details] apitrace replay --verbose --debug: stdout NB: stderr attached separately. Note that it stops after a certain swap buffers call, so I can only guess something occurred leading up to that which would cause difficulty. I note there are some attrib pointer calls in-between that and the previous swap, which from my understanding of bug 105251 was one thing that could cause crashes - however while that test program was fixed in the git build, this issue was not. I lack the knowledge to spot which particular call is the bad one though. Created attachment 144886 [details]
apitrace replay --verbose --debug: stderr
stderr of the same as above.
I made them separate as it helped me to have a look through them - though notably the stack trace I see at the end of this stderr output can't be placed in relation to stdout now, so if need be I can re-run the offending replay file with both redirected to the same file.
Created attachment 144887 [details]
dmesg log from boot to running apitrace replay on the above apitrace trace file
Notably there are a lot more "VM_L2_PROTECTION_FAULT_STATUS: ..." messages when replaying (this file) vs the original dmesg output (when I was able to hit the bug playing the game itself) in the main bug description.
Created attachment 144888 [details]
dmesg output from boot to stopping dmesg when killing xorg was possible
In this case I was able to kill xorg and return to the linux console. When this happens the protection faults continue in dmesg but the pid and thread id values go to zero, not sure if this is significant.
This particular dmesg output accompanies the attached apitrace stdout/stderr files from that replay run.
Created attachment 144889 [details]
Observed graphical corruption - left hand side of monitor
Taken in two photos as my screen's a rather large one.
This graphical corruption appears along the edges of some objects, which can *sometimes* occur either when running minetest directly and loading a world or replaying the above api trace.
However, I notice that sometimes no graphical corruption occors whatsoever but everything still freezes. That's on top of the fact that the freeze itself doesn't happen all the time... suggests something highly indetermistic at play?
Sometimes it flickers to a similarly corrupted version of the minetest logo before the freeze, haven't caught that on camera yet.
Created attachment 144890 [details]
Observed graphical corruption - right hand side of monitor
Other side of above.
The sky colour is otherwise undisturbed to the top edge of the monitor, hence why the top edge was not in shot. I doubt this camera would have picked up enough detail otherwise - again it's a fairly large monitor.
Some additional information I had neglected to mention in the initial description in the "excitement" of filing my first bug here... Relevant hardware is as stated a ryzen 2200G running solely on integrated vega graphics - I haven't mentioned any other specs of the system as this bug has persisted across replacements of all components, even the motherboard - only things that have not changed are the PSU, nvme storage and the ryzen chip itself. Distro is arch linux with all packages up to date at the time of writing. Kernel version 5.2.1-arch1-1-ARCH. Mesa built-from-git version mentonied in bug description. LLVM version 8.0.1. Any other information is available on request. Thanks for the bug report. I could reproduce the bug using the provided apitrace, both on a Ryzen platform and on a Vega Mobile laptop (can't reproduce on Navi). Using MESA_DEBUG=flush or AMD_DEBUG=check_vm seem to make the problem go away so my guess would be a synchronization / cache issue but I didn't find the root issue yet. Using AMD_DEBUG=nodpbb "fixes" the problem. The apitrace no longer causes issues on my system either if I use AMD_DEBUG=nodpbb . I also decided to try this on minetest and *so far* (bearing in mind the issue was indetermistic in the first place, so a decisive ruling is near impossible) I have not re-incurred a crash. Interestingly, what I have noticed is that sometimes when minetest did not lock up my system before, the loading bar would suffer mild graphical corruption (bits of the black border go white) - quite difficult to capture on camera due to being so fleeting. So far with nodpbb I have yet to observe these artefacts again. I did try launching a minetest world with AMD_DEBUG=check_vm instead, however I somehow still managed to get a lock-up that way with similar graphical corruption as the bug description. Alas it seems my btrfs root decided to eat my dmesg log file when I had to force power off, so unable to see if it was the dreaded VM_L2_PROTECTION_FAULT again >:( Could you test the branch from MR https://gitlab.freedesktop.org/mesa/mesa/merge_requests/1554 and let me know if it fixes the issue for you? The MR has been merged. Thanks for your help! Apologies for being late to reply. Having run mesa built from the MR branch, I have since been unable to get the same crash when running minetest. Certainly the apitrace capture can no longer bring my system down, however the actual program running was always less determistic than that, so it was hard to prove the absence of - that said, I have been playing the game again for a few days now and have not experienced the crash, so I feel reasonably comfortable it has gone. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.