Created attachment 137529 [details]
When playing The Talos principle at 3840*2160 and using "Max 3D Rendering MPIX" set to 8.4 (4K 2160) the game frequently hangs the system. If I lower the setting to 1.8 (1680x1050) the game does not seem to hang.
This has happened for as long as I can remember I just have not filed a bug until now.
I can ssh to the system after this happens but I can not reboot the system.
I am using a Vega 64, linux 4.16-rc2 and Mesa 18.0.0-rc4. Please let me know if I can provide any logs that could help
Attaching dmesg after the hang has happened.
Curious, I haven't noticed that and that sounds really similar to my benchmarking setup.
Not being able to reboot is a generic amdgpu problem after a hang.
It happening for so long as you can remember, how long are we talking about? What kernels have you used before where you had the issue and what mesa versions?
What could be useful is starting Talos with
You can do this for example by right clicking on the game name in the list -> properties -> set launch options ->
RADV_TRACE_FILE=/some/file/name RADV_DEBUG=allbos,syncshaders %command%
Given the nature of the issue, it is possible we crash after the hang while trying to produce the trace file. In that case a stacktrace would be useful to narrow down which packet we define illegally (though may need debug symbols to be useful).
I'm not sure for how long this has been happening. Definitely for as long as I had my Vega, so at least since the middle of october.
I almost always run the latest rc kernels and used to use Mesa from git but have been using 17.3 and now 18.0-rcs lately.
When I add RADV_TRACE_FILE=/home/erik/temp/tracefile RADV_DEBUG=allbos,syncshaders %command% starting the game freezes the desktop. The screen turns black as the game starts but nothing more happens.
I can ssh in and see this in dmesg:
[ 142.816155] [drm:amdgpu_job_timedout] *ERROR* ring comp_1.1.0 timeout, last signaled seq=2, last emitted seq=3
[ 142.816162] [drm] No hardware hang detected. Did some blocks stall?
I dont know if the tracefile is useful in this case but I'm attaching it to the bug.
Created attachment 137546 [details]
I should note that I have not played Talos very often (since it hangs my computer) so it might have worked fine at times. I just tried it now and then and gave up for a month or 2 when I got a hang to try with a new kernel or a more recent mesa etc. I do feel this was happening with my earlier card, a Fury X but I'm not sure.
I dont seen to be able to get a backtrace. When I try to attach the Talos process after a hang gdb just sits there trying to attach to the process. I'm not a developer so I dont really know what else to try.
I'm attaching the output from steam as there seems to be some more information in there.
Created attachment 137591 [details]
Managed to get a core dump finally. I am now on LLVM 6.0.0-rc3 and Mesa 18.0.0-rc4 and the hang still happens.
Attaching a new tracefile and a backtrace from the same crash. I compiled LLVM and Mesa with -ggdb. Please let me know if something more needs to have debugging symbols and I'll get a new backtrace.
Created attachment 137624 [details]
Created attachment 137625 [details]
Created attachment 137626 [details]
I saw that alot of the values were optimised out so I compiled Mesa with -O0 and got a new set of files.
Sorry for all the noise.
Created attachment 137627 [details]
Are you still able to reproduce this? FWIW, I launch Talos with different settings almost every week and I never got any GPU hangs.
I just played a little bit and there was a spot it always hung for me but now it works fine.
I havent tried the game since I last updated the bug I think until today.