When running the Bard's Tale IV, in the beginning of the game, if I turn around, it consistently is causing a GPU hang. And I see this in dmesg: [ 4246.501534] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted! [ 4251.365674] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=178390, emitted seq=178392 [ 4251.365740] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process BardsTale4-Linu pid 7251 thread BardsTale4:cs0 pid 7292 [ 4251.365742] [drm] GPU recovery disabled. GPU: Sapphire Pulse RX 5700 XT Kernel: 5.3.0-rc8+ OpenGL renderer string: AMD NAVI10 (DRM 3.33.0, 5.3.0-rc8+, LLVM 10.0.0) OpenGL core profile version string: 4.5 (Core Profile) Mesa 19.3.0-devel (git-87fa8d9ebc) Game version: GOG, release 1.0.0 (version 4.20.1 / 32050).
An apitrace of the problem would be helpful if you can get it.
I'll try to make a trace. The error message looks like this one: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c?h=v5.3-rc8#n5703
Is there any way to postpone tracing kick in, to avoid massive size of the file?
Uploaded the trace here (should be valid for 30 days): https://ufile.io/kvf9t1eu Sorry for huge size, there is an unskippable cutscene in the beginning. Compressed with pixz, so should be decompressible using all CPU cores (compatible with regular single threaded decompressing xz as well).
(In reply to Shmerl from comment #4) > Uploaded the trace here (should be valid for 30 days): > https://ufile.io/kvf9t1eu > > Sorry for huge size, there is an unskippable cutscene in the beginning. > > Compressed with pixz, so should be decompressible using all CPU cores > (compatible with regular single threaded decompressing xz as well). The attached trace doesn't cause a GPU hang here. Does it hang on your machine?
(In reply to Pierre-Eric Pelloux-Prayer from comment #5) > > > The attached trace doesn't cause a GPU hang here. > > Does it hang on your machine? I recorded it until the freeze happened, and then had to do Alt+SysRq+REISUB to reboot. So that's the resulting file. I'll try replaying the trace to see what happens.
Just replayed the trace - it ended before the buggy part. Something must have interrupted it, or may be it has a size cap? I'll try making it again.
Here is a new trace: https://uploadfiles.io/9uykx7nh Now it's catching the hang moment. Replaying it doesn't hang the GPU though, just produces some errors in the trace output.
Thanks! I can reproduce the problem using the new trace. It's strange the problem is caused by some shaders failing to link but the error message doesn't match what the shaders actually do. Also dumping out the shaders and compiling them with our shader-db tool also results in them compiling correctly. There is clearly a bug in here somewhere but will take some more digging to find it.
Ok. apitrace was pointing me to the incorrect shaders I managed to find the correct ones and can confirm this is a bug in the game itself. I have reported the problem to the developers, lets see if they reply. For completeness here is the body of the bug report: "The games shaders use GLSL 4.30 which mean interpolation qualifiers must match across shader interfaces otherwise it is a link-time error. In GLSL 4.40 this restriction was relaxed. There is at least one attempt in the game (maybe more?) to link a vertex shader output that sets the noperspective qualifier on an output to a fragment shader input where no interpolation qualifier is set. This results in hangs and stuttering in the game when it attempts to use the program that failed to link. I've attached the problem shaders in a text file."
Since the game is using Unreal Engine, I wonder if developers control shaders directly, or it's something produced by UE toolchain that transpiles them from something else. I mean it could be upstream UE bug. Just for the record, game shows it's using Unreal Engine 4.20.1-150741.
For now you could try using the environment variable: allow_glsl_cross_stage_interpolation_mismatch=true
(In reply to Timothy Arceri from comment #12) > For now you could try using the environment variable: > > allow_glsl_cross_stage_interpolation_mismatch=true Thanks! I tried setting it, and it shows the message that it's overridden, but the game still hangs.
(In reply to Shmerl from comment #13) > (In reply to Timothy Arceri from comment #12) > > For now you could try using the environment variable: > > > > allow_glsl_cross_stage_interpolation_mismatch=true > > Thanks! I tried setting it, and it shows the message that it's overridden, > but the game still hangs. Are you sure it is hanging? There is a huge amount of stuttering due to the game compiling shaders in-game. Its really bad the first time I run the apitrace but much better the second time.
(In reply to Timothy Arceri from comment #14) > Are you sure it is hanging? There is a huge amount of stuttering due to the > game compiling shaders in-game. Its really bad the first time I run the > apitrace but much better the second time. I couldn't even switch to tty using Ctrl+Alt+F1, so I didn't check dmesg and just SysRq rebooted. Next time if this happens with override, may be I can try accessing it over ssh remotely to check if it's different from before.
"The games shaders use GLSL 4.30 which mean interpolation qualifiers must match across shader interfaces otherwise it is a link-time error. In GLSL 4.40 this restriction was relaxed." I believe that relaxation came in version 4.30, not 4.40. The 4.30 spec here: https://www.khronos.org/registry/OpenGL/specs/gl/GLSLangSpec.4.30.pdf From the "4.3.4 Input Variables" section: "The fragment shader inputs form an interface with the last active shader in the vertex processing pipeline. For this interface, the last active shader stage output variables and fragment shader input variables of the same name must match in type and qualification, with a few exceptions: The storage qualifiers must, of course, differ (one is in and one is out). Also, interpolation qualification (e.g., flat) and auxiliary qualification (e.g. centroid) may differ. These mismatches are allowed between any pair of stages. When interpolation or auxiliary qualifiers do not match, those provided in the fragment shader supersede those provided in previous stages. If any such qualifiers are completely missing in the fragment shaders, then the default is used, rather than any qualifiers that may have been declared in previous stages. That is, what matters is what is declared in the fragment shaders, not what is declared in shaders in previous stages." That language is identical between 4.30 and 4.40. It sounds like it explicitly allows interpolation qualifiers to differ. However the 4.20 spec language in that section was quite different and did require an interpolation qualifier match. Also, from https://www.khronos.org/opengl/wiki/Shader_Compilation#Interface_matching: "If GLSL 4.30 or later is available, then the interpolation qualifiers (including centroid and sample) do not need to match."
(In reply to Timothy Arceri from comment #14) > Are you sure it is hanging? There is a huge amount of stuttering due to the > game compiling shaders in-game. Its really bad the first time I run the > apitrace but much better the second time. It is a hang. Even with allow_glsl_cross_stage_interpolation_mismatch=true it gets stuck permanently. I was able to log into the system over ssh when that happened, and this was shown in dmesg: [ 149.642857] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted! [ 154.762918] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=20378, emitted seq=20380 [ 154.762984] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process BardsTale4-Linu pid 2563 thread BardsTale4:cs0 pid 2597 [ 154.762986] [drm] GPU recovery disabled. [ 363.660017] INFO: task BardsTale4-Linu:2563 blocked for more than 120 seconds. [ 363.660021] Tainted: G E 5.3.0-rc8+ #14 [ 363.660022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 363.660023] BardsTale4-Linu D 0 2563 2556 0x80004002 [ 363.660026] Call Trace: [ 363.660033] ? __schedule+0x2b9/0x6c0 [ 363.660035] schedule+0x39/0xa0 [ 363.660037] schedule_timeout+0x20f/0x300 [ 363.660040] dma_fence_default_wait+0x1c2/0x2a0 [ 363.660042] ? dma_fence_free+0x20/0x20 [ 363.660044] dma_fence_wait_timeout+0xdd/0xf0 [ 363.660106] gmc_v10_0_flush_gpu_tlb+0x159/0x1a0 [amdgpu] [ 363.660157] amdgpu_gart_unbind+0x89/0xb0 [amdgpu] [ 363.660206] amdgpu_ttm_backend_unbind+0x3c/0xe0 [amdgpu] [ 363.660211] ttm_tt_unbind+0x1d/0x30 [ttm] [ 363.660215] ttm_tt_destroy.part.0+0xe/0x50 [ttm] [ 363.660219] ttm_bo_cleanup_memtype_use+0x2e/0x70 [ttm] [ 363.660222] ttm_bo_put+0x24e/0x2a0 [ttm] [ 363.660269] amdgpu_bo_unref+0x1a/0x30 [amdgpu] [ 363.660317] amdgpu_gem_object_free+0x2e/0x50 [amdgpu] [ 363.660328] drm_gem_object_release_handle+0x5a/0xc0 [drm] [ 363.660339] ? drm_gem_object_handle_put_unlocked+0x90/0x90 [drm] [ 363.660341] idr_for_each+0x5e/0xd0 [ 363.660344] ? __inode_wait_for_writeback+0x7e/0xf0 [ 363.660354] drm_gem_release+0x1c/0x30 [drm] [ 363.660363] drm_file_free.part.0+0x2ab/0x300 [drm] [ 363.660373] drm_release+0x4b/0x80 [drm] [ 363.660375] __fput+0xb9/0x250 [ 363.660378] task_work_run+0x8a/0xb0 [ 363.660381] do_exit+0x2f5/0xb60 [ 363.660383] do_group_exit+0x3a/0xa0 [ 363.660385] get_signal+0x15b/0x890 [ 363.660387] do_signal+0x30/0x690 [ 363.660390] ? _copy_from_user+0x37/0x60 [ 363.660393] exit_to_usermode_loop+0x91/0xf0 [ 363.660394] do_syscall_64+0x100/0x110 [ 363.660396] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 363.660398] RIP: 0033:0x4540f22 [ 363.660403] Code: Bad RIP value. [ 363.660404] RSP: 002b:00007fff54bf6c30 EFLAGS: 00210202 [ 363.660406] RAX: 00007fff54bf6c30 RBX: 0000000000000001 RCX: 00000000939f4000 [ 363.660406] RDX: 00007fff54bf6c88 RSI: 00007fff54bf6c98 RDI: 00007fff54bf6c80 [ 363.660407] RBP: 00007fa81869c430 R08: 000000000000021f R09: 000000000936d890 [ 363.660408] R10: 0000000000000001 R11: 0000000000200206 R12: 00007fff54bf6d90 [ 363.660408] R13: 0000000000000008 R14: 000000000768bdd8 R15: 00007fff54bf6ce0 May be trace alone isn't enough to reproduce it? Did you try the actual game?
Just for the reference, I'm using firmware from here: https://people.freedesktop.org/~agd5f/radeon_ucode/navi10/
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1427.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.