Bug 109819 - [APITRACE] Shadow of Mordor causes gpu freeze ryzen 2200g
Summary: [APITRACE] Shadow of Mordor causes gpu freeze ryzen 2200g
Status: RESOLVED MOVED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: DRI git
Hardware: x86-64 (AMD64) All
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL: https://letz.tw/ShadowOfMordor.trace.xz
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-03-03 17:09 UTC by Dominic
Modified: 2019-11-19 09:15 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
dumps from dmesg, glxinfo and xorg (39.77 KB, application/zip)
2019-03-03 17:09 UTC, Dominic
no flags Details
New 5.0 Kernel Crashlog (67.86 KB, application/zip)
2019-03-04 17:35 UTC, Dominic
no flags Details
Photo of apitrace replay after freeze (341.65 KB, image/jpeg)
2019-03-08 00:49 UTC, Dominic
no flags Details
Vertex & Fragment Shader per apitrace just before the crash (7.03 KB, text/plain)
2019-03-10 17:40 UTC, Dominic
no flags Details
UMR dump (458.24 KB, text/plain)
2019-03-10 17:43 UTC, Dominic
no flags Details
sudo umr -lb sudo umr -R gfx[.] sudo umr -R sdma0[.] sudo umr -R sdma1[.] (361.84 KB, text/plain)
2019-03-10 17:45 UTC, Dominic
no flags Details
Attached screenshot of mentioned apitrace line 14840194 (last line) (136.68 KB, image/png)
2019-03-10 18:06 UTC, Dominic
no flags Details

Description Dominic 2019-03-03 17:09:02 UTC
Created attachment 143516 [details]
dumps from dmesg, glxinfo and xorg

Using Kernel 4.20.13 (on Ubuntu 18.04.2) the game Shadow of Mordor installed from steam will freeze the screen after 5-120 minutes. SSH'ing into the machine still works.

I'm attaching glxinfo, Xorg.log, and dmesg log from the crash for reference.

Btw. I've added "drm.debug=0x1e log_buf_len=1M" to grub but wasn't able so far to catch anything writting to /sys/class/drm/card0/error


Let me know if there is anything I can do to help debugging.
Comment 1 Dominic 2019-03-04 17:33:34 UTC
Per Linux Kernel 5.0 release here an updated report with that newest kernel and updated head from git://anongit.freedesktop.org/mesa/drm

With the Linux Kernel 5.0 the dmesg log if full of amdgpu spam, that seems to repeat itself all the time, independent of operation - not sure if it's related to the grub debug line.

The the freeze though seems to still appear the same way but the error message in dmesg has changed and now just shows two lines: that occur at the freeze point:

[ 2501.329358] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=300047, emitted seq=300049
[ 2501.329419] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process ShadowOfMordor pid 3150 thread ShadowOfMo:cs0 pid 3152
[ 2501.329421] [drm] GPU recovery disabled.

Full dmesg log and glxinfo output in the new attached dumps_from_dmesg_and_glxinfo_2
Comment 2 Dominic 2019-03-04 17:35:05 UTC
Created attachment 143522 [details]
New 5.0 Kernel Crashlog
Comment 3 Dominic 2019-03-08 00:29:50 UTC
I've created an apitrace and can reproduce the issue everytime by replaying "apitrace replay ShadowOfMordor.trace". It's quite big - 10gb compressed xz but still here it comes: https://letz.tw/ShadowOfMordor.trace.xz
Comment 4 Dominic 2019-03-08 00:49:49 UTC
Created attachment 143579 [details]
Photo of apitrace replay after freeze

I've added a photo of running the apitrace verbose to see what the last calls printed are. Last visible call is 14838798 - photo attached.
Comment 5 Dominic 2019-03-08 18:24:53 UTC
Fun fact. Binary searching the apitrace by playing to different calls I was able to identify that my GPU hangs everytime on this call in the apitrace:

14840194 @5 glDrawElementsBaseVertex(mode = GL_TRIANGLES, count = 60, type = GL_UNSIGNED_SHORT, indices = NULL, basevertex = 0) 

I can replay until the previous call 14840193 safely but trying to play until 14840194 freezes everytime. 

So checking the OpenGL docks I'm not quite sure that indices and basevertex are allowed to be NULL/0 could that be an issue?
Comment 6 Dominic 2019-03-10 17:40:22 UTC
Created attachment 143612 [details]
Vertex & Fragment Shader per apitrace just before the crash

I've done some more software updates:
- Kernel 5.0.1
- Mesa 1.8.4

But the crash still happens at the very same opengl instruction. 

So the last mentioned glDrawElementsBaseVertex() call is definitely the point of the crash but the damage that makes the gpu freeze seems to have been created by earlier calls. I found from more testing that the previous glUseProgram() seems to be required to trigger the crash. So I've attached the vertex & fragment shader as shown in apitrace.
Comment 7 Dominic 2019-03-10 17:43:24 UTC
Created attachment 143613 [details]
UMR dump

Additionally I've seen from another bug https://bugs.freedesktop.org/show_bug.cgi?id=102322 the usage of UMR so here is an attached call from: sudo umr -O verbose -R gfx[.] &> umr-verbose-mar11.txt
Comment 8 Dominic 2019-03-10 17:45:58 UTC
Created attachment 143614 [details]
sudo umr -lb sudo umr -R gfx[.] sudo umr -R sdma0[.] sudo umr -R sdma1[.]

And from running this

#!/bin/bash
set -x
sudo umr -lb
sudo umr -R gfx[.]
sudo umr -R sdma0[.]
sudo umr -R sdma1[.]

./run.sh &> umr-mar11.txt attached output as well.
Comment 9 Dominic 2019-03-10 18:06:47 UTC
Created attachment 143615 [details]
Attached screenshot of mentioned apitrace line 14840194 (last line)
Comment 10 Pierre-Eric Pelloux-Prayer 2019-09-16 07:50:20 UTC
I could replay the trace 3 times without getting a gpu hang using a recent kernel and mesa master.

Can you still reproduce the problem?
Comment 11 Dominic 2019-09-16 09:54:20 UTC
I'm travelling right now, but can check once home again.
Comment 12 Martin Peres 2019-11-19 09:15:45 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/712.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.