Bug 111591 - [radeonsi/Navi] The Bard's Tale IV causes a GPU hang
Summary: [radeonsi/Navi] The Bard's Tale IV causes a GPU hang
Status: RESOLVED MOVED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Gallium/radeonsi (show other bugs)
Version: git
Hardware: x86-64 (AMD64) Linux (All)
: not set normal
Assignee: Default DRI bug account
QA Contact: Default DRI bug account
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-09-09 01:47 UTC by Shmerl
Modified: 2019-09-25 18:50 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments

Description Shmerl 2019-09-09 01:47:31 UTC
When running the Bard's Tale IV, in the beginning of the game, if I turn around, it consistently is causing a GPU hang. And I see this in dmesg:

[ 4246.501534] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
[ 4251.365674] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=178390, emitted seq=178392
[ 4251.365740] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process BardsTale4-Linu pid 7251 thread BardsTale4:cs0 pid 7292
[ 4251.365742] [drm] GPU recovery disabled.

GPU: Sapphire Pulse RX 5700 XT
Kernel: 5.3.0-rc8+
OpenGL renderer string: AMD NAVI10 (DRM 3.33.0, 5.3.0-rc8+, LLVM 10.0.0)
OpenGL core profile version string: 4.5 (Core Profile) Mesa 19.3.0-devel (git-87fa8d9ebc)
Game version: GOG, release 1.0.0 (version 4.20.1 / 32050).
Comment 1 Timothy Arceri 2019-09-09 05:24:45 UTC
An apitrace of the problem would be helpful if you can get it.
Comment 2 Shmerl 2019-09-09 23:11:15 UTC
I'll try to make a trace. The error message looks like this one:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c?h=v5.3-rc8#n5703
Comment 3 Shmerl 2019-09-10 00:49:46 UTC
Is there any way to postpone tracing kick in, to avoid massive size of the file?
Comment 4 Shmerl 2019-09-10 01:14:31 UTC
Uploaded the trace here (should be valid for 30 days): https://ufile.io/kvf9t1eu

Sorry for huge size, there is an unskippable cutscene in the beginning.

Compressed with pixz, so should be decompressible using all CPU cores (compatible with regular single threaded decompressing xz as well).
Comment 5 Pierre-Eric Pelloux-Prayer 2019-09-10 15:27:10 UTC
(In reply to Shmerl from comment #4)
> Uploaded the trace here (should be valid for 30 days):
> https://ufile.io/kvf9t1eu
> 
> Sorry for huge size, there is an unskippable cutscene in the beginning.
> 
> Compressed with pixz, so should be decompressible using all CPU cores
> (compatible with regular single threaded decompressing xz as well).

The attached trace doesn't cause a GPU hang here.

Does it hang on your machine?
Comment 6 Shmerl 2019-09-10 15:31:37 UTC
(In reply to Pierre-Eric Pelloux-Prayer from comment #5)
>
> 
> The attached trace doesn't cause a GPU hang here.
> 
> Does it hang on your machine?

I recorded it until the freeze happened, and then had to do Alt+SysRq+REISUB to reboot. So that's the resulting file. I'll try replaying the trace to see what happens.
Comment 7 Shmerl 2019-09-10 22:36:40 UTC
Just replayed the trace - it ended before the buggy part. Something must have interrupted it, or may be it has a size cap? I'll try making it again.
Comment 8 Shmerl 2019-09-10 23:13:48 UTC
Here is a new trace: https://uploadfiles.io/9uykx7nh

Now it's catching the hang moment. Replaying it doesn't hang the GPU though, just produces some errors in the trace output.
Comment 9 Timothy Arceri 2019-09-11 01:38:10 UTC
Thanks! I can reproduce the problem using the new trace.

It's strange the problem is caused by some shaders failing to link but the error message doesn't match what the shaders actually do. Also dumping out the shaders and compiling them with our shader-db tool also results in them compiling correctly. There is clearly a bug in here somewhere but will take some more digging to find it.
Comment 10 Timothy Arceri 2019-09-11 03:12:54 UTC
Ok. apitrace was pointing me to the incorrect shaders I managed to find the correct ones and can confirm this is a bug in the game itself. I have reported the problem to the developers, lets see if they reply.

For completeness here is the body of the bug report:

"The games shaders use GLSL 4.30 which mean interpolation qualifiers must match across shader interfaces otherwise it is a link-time error. In GLSL 4.40 this restriction was relaxed. 

There is at least one attempt in the game (maybe more?) to link a vertex shader output that sets the noperspective qualifier on an output to a fragment shader input where no interpolation qualifier is set. This results in hangs and stuttering in the game when it attempts to use the program that failed to link. 

I've attached the problem shaders in a text file."
Comment 11 Shmerl 2019-09-11 03:19:45 UTC
Since the game is using Unreal Engine, I wonder if developers control shaders directly, or it's something produced by UE toolchain that transpiles them from something else. I mean it could be upstream UE bug.

Just for the record, game shows it's using Unreal Engine 4.20.1-150741.
Comment 12 Timothy Arceri 2019-09-11 03:35:42 UTC
For now you could try using the environment variable:

allow_glsl_cross_stage_interpolation_mismatch=true
Comment 13 Shmerl 2019-09-11 03:42:55 UTC
(In reply to Timothy Arceri from comment #12)
> For now you could try using the environment variable:
> 
> allow_glsl_cross_stage_interpolation_mismatch=true

Thanks! I tried setting it, and it shows the message that it's overridden, but the game still hangs.
Comment 14 Timothy Arceri 2019-09-11 03:49:09 UTC
(In reply to Shmerl from comment #13)
> (In reply to Timothy Arceri from comment #12)
> > For now you could try using the environment variable:
> > 
> > allow_glsl_cross_stage_interpolation_mismatch=true
> 
> Thanks! I tried setting it, and it shows the message that it's overridden,
> but the game still hangs.

Are you sure it is hanging? There is a huge amount of stuttering due to the game compiling shaders in-game. Its really bad the first time I run the apitrace but much better the second time.
Comment 15 Shmerl 2019-09-11 04:20:28 UTC
(In reply to Timothy Arceri from comment #14)
> Are you sure it is hanging? There is a huge amount of stuttering due to the
> game compiling shaders in-game. Its really bad the first time I run the
> apitrace but much better the second time.


I couldn't even switch to tty using Ctrl+Alt+F1, so I didn't check dmesg and just SysRq rebooted. Next time if this happens with override, may be I can try accessing it over ssh remotely to check if it's different from before.
Comment 16 vggl 2019-09-11 17:44:14 UTC
"The games shaders use GLSL 4.30 which mean interpolation qualifiers must match across shader interfaces otherwise it is a link-time error. In GLSL 4.40 this restriction was relaxed."

I believe that relaxation came in version 4.30, not 4.40.

The 4.30 spec here: https://www.khronos.org/registry/OpenGL/specs/gl/GLSLangSpec.4.30.pdf

From the "4.3.4 Input Variables" section:

"The fragment shader inputs form an interface with the last active shader in the vertex processing pipeline. For this interface, the last active shader stage output variables and fragment shader input variables of the same name must match in type and qualification, with a few exceptions: The storage qualifiers must, of course, differ (one is in and one is out). Also, interpolation qualification (e.g., flat) and auxiliary qualification (e.g. centroid) may differ. These mismatches are allowed between any pair of stages. When interpolation or auxiliary qualifiers do not match, those provided in the fragment shader supersede those provided in previous stages. If any such qualifiers are completely missing in the fragment shaders, then the default is used, rather than any qualifiers that may have been declared in previous stages. That is, what matters is what is declared in the fragment shaders, not what is declared in shaders in previous stages."

That language is identical between 4.30 and 4.40. It sounds like it explicitly allows interpolation qualifiers to differ.  However the 4.20 spec language in that section was quite different and did require an interpolation qualifier match.

Also, from https://www.khronos.org/opengl/wiki/Shader_Compilation#Interface_matching:

"If GLSL 4.30 or later is available, then the interpolation qualifiers (including centroid and sample) do not need to match."
Comment 17 Shmerl 2019-09-11 23:30:22 UTC
(In reply to Timothy Arceri from comment #14)
> Are you sure it is hanging? There is a huge amount of stuttering due to the
> game compiling shaders in-game. Its really bad the first time I run the
> apitrace but much better the second time.

It is a hang. Even with allow_glsl_cross_stage_interpolation_mismatch=true it gets stuck permanently. I was able to log into the system over ssh when that happened, and this was shown in dmesg:

[  149.642857] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
[  154.762918] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=20378, emitted seq=20380
[  154.762984] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process BardsTale4-Linu pid 2563 thread BardsTale4:cs0 pid 2597
[  154.762986] [drm] GPU recovery disabled.
[  363.660017] INFO: task BardsTale4-Linu:2563 blocked for more than 120 seconds.
[  363.660021]       Tainted: G            E     5.3.0-rc8+ #14
[  363.660022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  363.660023] BardsTale4-Linu D    0  2563   2556 0x80004002
[  363.660026] Call Trace:
[  363.660033]  ? __schedule+0x2b9/0x6c0
[  363.660035]  schedule+0x39/0xa0
[  363.660037]  schedule_timeout+0x20f/0x300
[  363.660040]  dma_fence_default_wait+0x1c2/0x2a0
[  363.660042]  ? dma_fence_free+0x20/0x20
[  363.660044]  dma_fence_wait_timeout+0xdd/0xf0
[  363.660106]  gmc_v10_0_flush_gpu_tlb+0x159/0x1a0 [amdgpu]
[  363.660157]  amdgpu_gart_unbind+0x89/0xb0 [amdgpu]
[  363.660206]  amdgpu_ttm_backend_unbind+0x3c/0xe0 [amdgpu]
[  363.660211]  ttm_tt_unbind+0x1d/0x30 [ttm]
[  363.660215]  ttm_tt_destroy.part.0+0xe/0x50 [ttm]
[  363.660219]  ttm_bo_cleanup_memtype_use+0x2e/0x70 [ttm]
[  363.660222]  ttm_bo_put+0x24e/0x2a0 [ttm]
[  363.660269]  amdgpu_bo_unref+0x1a/0x30 [amdgpu]
[  363.660317]  amdgpu_gem_object_free+0x2e/0x50 [amdgpu]
[  363.660328]  drm_gem_object_release_handle+0x5a/0xc0 [drm]
[  363.660339]  ? drm_gem_object_handle_put_unlocked+0x90/0x90 [drm]
[  363.660341]  idr_for_each+0x5e/0xd0
[  363.660344]  ? __inode_wait_for_writeback+0x7e/0xf0
[  363.660354]  drm_gem_release+0x1c/0x30 [drm]
[  363.660363]  drm_file_free.part.0+0x2ab/0x300 [drm]
[  363.660373]  drm_release+0x4b/0x80 [drm]
[  363.660375]  __fput+0xb9/0x250
[  363.660378]  task_work_run+0x8a/0xb0
[  363.660381]  do_exit+0x2f5/0xb60
[  363.660383]  do_group_exit+0x3a/0xa0
[  363.660385]  get_signal+0x15b/0x890
[  363.660387]  do_signal+0x30/0x690
[  363.660390]  ? _copy_from_user+0x37/0x60
[  363.660393]  exit_to_usermode_loop+0x91/0xf0
[  363.660394]  do_syscall_64+0x100/0x110
[  363.660396]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  363.660398] RIP: 0033:0x4540f22
[  363.660403] Code: Bad RIP value.
[  363.660404] RSP: 002b:00007fff54bf6c30 EFLAGS: 00210202
[  363.660406] RAX: 00007fff54bf6c30 RBX: 0000000000000001 RCX: 00000000939f4000
[  363.660406] RDX: 00007fff54bf6c88 RSI: 00007fff54bf6c98 RDI: 00007fff54bf6c80
[  363.660407] RBP: 00007fa81869c430 R08: 000000000000021f R09: 000000000936d890
[  363.660408] R10: 0000000000000001 R11: 0000000000200206 R12: 00007fff54bf6d90
[  363.660408] R13: 0000000000000008 R14: 000000000768bdd8 R15: 00007fff54bf6ce0

May be trace alone isn't enough to reproduce it? Did you try the actual game?
Comment 18 Shmerl 2019-09-12 00:52:41 UTC
Just for the reference, I'm using firmware from here: https://people.freedesktop.org/~agd5f/radeon_ucode/navi10/
Comment 19 GitLab Migration User 2019-09-25 18:50:44 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1427.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.