Bug 112266 - [Navi] Pathfinder: Kingmaker is causing a GPU hang: flip_done timed out error
Summary: [Navi] Pathfinder: Kingmaker is causing a GPU hang: flip_done timed out error
Status: RESOLVED MOVED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: not set normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-11-14 01:54 UTC by Shmerl
Modified: 2019-11-19 10:01 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
possible fix (1.73 KB, patch)
2019-11-15 16:04 UTC, Alex Deucher
no flags Details | Splinter Review

Description Shmerl 2019-11-14 01:54:28 UTC
When running Pathfinder: Kingmaker (latest GOG release, which should be the same as latest Steam one) on Sapphire Pulse RX 5700 XT, it's causing a weird GPU hang with flip_done timed out error (see below for detailed log), that doesn't look like the common shader hangs with ring gfx_0.0.0 timeout or common sdma hangs.

The game is using OpenGL, and I run the game on Debian testing, using this configuration:

kernel: 5.4-rc7
radeonsi: Mesa-master / llvm10:

OpenGL renderer string: AMD NAVI10 (DRM 3.35.0, 5.4.0-rc7, LLVM 10.0.0)
OpenGL core profile version string: 4.5 (Core Profile) Mesa 20.0.0-devel (git-eb6352162d)

llvm: 10~+201911120943210600592dd459242
from this llvm10 snapshot: https://tracker.debian.org/news/1079513/accepted-llvm-toolchain-snapshot-110201911120943210600592dd459242-1exp1-source-into-experimental/


DE: KDE Plasma 5.14.5 (X session).
GPU: Sapphire Pulse RX 5700 XT
Monitor: LG 27GL85-B (2560x1440, 144 Hz, DisplayPort 1.4 connection, adaptive sync activated in Xorg configuration).

When launching, I'm using AMD_DEBUG=nodma,nongg

Recording apitrace doesn't help, since replaying it is not reproducing the hang. So it could be some amdgpu issue? Please let me know, what additional info can be useful to help you narrow it down. However the hang is quite reproducible, and you can try it yourself with Pathfinder: Kingmaker.

The hang produces this in dmesg:

[  659.445501] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:62:crtc-0] flip_done timed out
[  669.685601] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [PLANE:55:plane-5] flip_done timed out
[  669.685644] ------------[ cut here ]------------
[  669.685729] WARNING: CPU: 6 PID: 1018 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:5851 amdgpu_dm_atomic_commit_tail+0x1c56/0x1d70 [amdgpu]
[  669.685730] Modules linked in: rfcomm(E) nf_tables(E) nfnetlink(E) bnep(E) edac_mce_amd(E) kvm_amd(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) btusb(E) btrtl(E) snd_hda_codec_realtek(E) btbcm(E) crc32_pclmul(E) btintel(E) iwlmvm(E) snd_hda_codec_generic(E) bluetooth(E) ghash_clmulni_intel(E) ledtrig_audio(E) mac80211(E) libarc4(E) snd_hda_codec_hdmi(E) uvcvideo(E) snd_hda_intel(E) videobuf2_vmalloc(E) snd_usb_audio(E) snd_intel_nhlt(E) videobuf2_memops(E) drbg(E) snd_hda_codec(E) videobuf2_v4l2(E) snd_usbmidi_lib(E) iwlwifi(E) nls_ascii(E) snd_hda_core(E) snd_rawmidi(E) videobuf2_common(E) snd_seq_device(E) snd_hwdep(E) efi_pstore(E) nls_cp437(E) ansi_cprng(E) snd_pcm(E) videodev(E) sp5100_tco(E) aesni_intel(E) cfg80211(E) vfat(E) ecdh_generic(E) crypto_simd(E) ecc(E) snd_timer(E) fat(E) ccp(E) snd(E) cryptd(E) mc(E) glue_helper(E) crc16(E) wmi_bmof(E) pcspkr(E) efivars(E) k10temp(E) watchdog(E) sg(E) rfkill(E) soundcore(E) rng_core(E) evdev(E) acpi_cpufreq(E) nct6775(E) hwmon_vid(E)
[  669.685753]  parport_pc(E) ppdev(E) lp(E) parport(E) efivarfs(E) ip_tables(E) x_tables(E) autofs4(E) xfs(E) btrfs(E) xor(E) zstd_decompress(E) zstd_compress(E) raid6_pq(E) libcrc32c(E) crc32c_generic(E) sd_mod(E) hid_generic(E) usbhid(E) hid(E) amdgpu(E) gpu_sched(E) mxm_wmi(E) ahci(E) ttm(E) libahci(E) drm_kms_helper(E) xhci_pci(E) crc32c_intel(E) xhci_hcd(E) i2c_piix4(E) libata(E) drm(E) igb(E) dca(E) mfd_core(E) ptp(E) scsi_mod(E) usbcore(E) pps_core(E) i2c_algo_bit(E) nvme(E) nvme_core(E) wmi(E) button(E)
[  669.685770] CPU: 6 PID: 1018 Comm: Xorg Tainted: G            E     5.4.0-rc7 #31
[  669.685771] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X570 Taichi, BIOS P2.50 11/02/2019
[  669.685846] RIP: 0010:amdgpu_dm_atomic_commit_tail+0x1c56/0x1d70 [amdgpu]
[  669.685847] Code: 67 fb ff ff 41 8b 4c 24 60 48 c7 c2 60 d6 a2 c0 bf 02 00 00 00 48 c7 c6 80 f8 a9 c0 e8 e3 7d bb ff 49 8b 47 08 e9 31 e5 ff ff <0f> 0b e9 b4 ec ff ff 0f 0b 0f 0b e9 cb ec ff ff 48 8b 85 b0 fd ff
[  669.685848] RSP: 0018:ffffb80fc1a978d0 EFLAGS: 00010002
[  669.685849] RAX: 0000000000000002 RBX: ffff9454b5d54c00 RCX: ffff9455ec2c6170
[  669.685850] RDX: 0000000000000001 RSI: 0000000000000206 RDI: ffff9455eaba6158
[  669.685851] RBP: ffffb80fc1a97b80 R08: 0000000000000005 R09: 0000000000000000
[  669.685851] R10: ffffb80fc1a97838 R11: ffffb80fc1a9783c R12: 0000000000000206
[  669.685852] R13: ffff9455ec2c6000 R14: ffff94559d443800 R15: ffff9455eda20000
[  669.685853] FS:  00007fc6a5a21f00(0000) GS:ffff9455fe980000(0000) knlGS:0000000000000000
[  669.685854] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  669.685855] CR2: 00007fc6a5991678 CR3: 00000007f0390000 CR4: 0000000000340ee0
[  669.685856] Call Trace:
[  669.685864]  ? __irq_work_queue_local+0x50/0x60
[  669.685872]  ? commit_tail+0x94/0x110 [drm_kms_helper]
[  669.685878]  commit_tail+0x94/0x110 [drm_kms_helper]
[  669.685884]  drm_atomic_helper_commit+0xb8/0x130 [drm_kms_helper]
[  669.685889]  drm_atomic_helper_set_config+0x79/0x90 [drm_kms_helper]
[  669.685902]  drm_mode_setcrtc+0x194/0x6a0 [drm]
[  669.685956]  ? amdgpu_cs_wait_ioctl+0xeb/0x160 [amdgpu]
[  669.685966]  ? drm_mode_getcrtc+0x180/0x180 [drm]
[  669.685976]  drm_ioctl_kernel+0xaa/0xf0 [drm]
[  669.685986]  drm_ioctl+0x208/0x390 [drm]
[  669.685995]  ? drm_mode_getcrtc+0x180/0x180 [drm]
[  669.686044]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[  669.686048]  do_vfs_ioctl+0x40e/0x670
[  669.686050]  ksys_ioctl+0x5e/0x90
[  669.686052]  __x64_sys_ioctl+0x16/0x20
[  669.686055]  do_syscall_64+0x52/0x160
[  669.686058]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  669.686060] RIP: 0033:0x7fc6a5f6a5b7
[  669.686061] Code: 00 00 90 48 8b 05 d9 78 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a9 78 0c 00 f7 d8 64 89 01 48
[  669.686062] RSP: 002b:00007ffd36fb37a8 EFLAGS: 00003246 ORIG_RAX: 0000000000000010
[  669.686063] RAX: ffffffffffffffda RBX: 00007ffd36fb37e0 RCX: 00007fc6a5f6a5b7
[  669.686064] RDX: 00007ffd36fb37e0 RSI: 00000000c06864a2 RDI: 000000000000000d
[  669.686064] RBP: 00000000c06864a2 R08: 0000000000000000 R09: 000055c668ad0740
[  669.686065] R10: 0000000000000000 R11: 0000000000003246 R12: 0000000000000000
[  669.686065] R13: 000000000000000d R14: 000055c668a607d0 R15: 0000000000000000
[  669.686067] ---[ end trace 47feccd771299f6b ]---
[  669.686082] ------------[ cut here ]------------
[  669.686158] WARNING: CPU: 6 PID: 1018 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:5458 amdgpu_dm_atomic_commit_tail+0x1c5f/0x1d70 [amdgpu]
[  669.686158] Modules linked in: rfcomm(E) nf_tables(E) nfnetlink(E) bnep(E) edac_mce_amd(E) kvm_amd(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) btusb(E) btrtl(E) snd_hda_codec_realtek(E) btbcm(E) crc32_pclmul(E) btintel(E) iwlmvm(E) snd_hda_codec_generic(E) bluetooth(E) ghash_clmulni_intel(E) ledtrig_audio(E) mac80211(E) libarc4(E) snd_hda_codec_hdmi(E) uvcvideo(E) snd_hda_intel(E) videobuf2_vmalloc(E) snd_usb_audio(E) snd_intel_nhlt(E) videobuf2_memops(E) drbg(E) snd_hda_codec(E) videobuf2_v4l2(E) snd_usbmidi_lib(E) iwlwifi(E) nls_ascii(E) snd_hda_core(E) snd_rawmidi(E) videobuf2_common(E) snd_seq_device(E) snd_hwdep(E) efi_pstore(E) nls_cp437(E) ansi_cprng(E) snd_pcm(E) videodev(E) sp5100_tco(E) aesni_intel(E) cfg80211(E) vfat(E) ecdh_generic(E) crypto_simd(E) ecc(E) snd_timer(E) fat(E) ccp(E) snd(E) cryptd(E) mc(E) glue_helper(E) crc16(E) wmi_bmof(E) pcspkr(E) efivars(E) k10temp(E) watchdog(E) sg(E) rfkill(E) soundcore(E) rng_core(E) evdev(E) acpi_cpufreq(E) nct6775(E) hwmon_vid(E)
[  669.686175]  parport_pc(E) ppdev(E) lp(E) parport(E) efivarfs(E) ip_tables(E) x_tables(E) autofs4(E) xfs(E) btrfs(E) xor(E) zstd_decompress(E) zstd_compress(E) raid6_pq(E) libcrc32c(E) crc32c_generic(E) sd_mod(E) hid_generic(E) usbhid(E) hid(E) amdgpu(E) gpu_sched(E) mxm_wmi(E) ahci(E) ttm(E) libahci(E) drm_kms_helper(E) xhci_pci(E) crc32c_intel(E) xhci_hcd(E) i2c_piix4(E) libata(E) drm(E) igb(E) dca(E) mfd_core(E) ptp(E) scsi_mod(E) usbcore(E) pps_core(E) i2c_algo_bit(E) nvme(E) nvme_core(E) wmi(E) button(E)
[  669.686187] CPU: 6 PID: 1018 Comm: Xorg Tainted: G        W   E     5.4.0-rc7 #31
[  669.686187] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X570 Taichi, BIOS P2.50 11/02/2019
[  669.686258] RIP: 0010:amdgpu_dm_atomic_commit_tail+0x1c5f/0x1d70 [amdgpu]
[  669.686259] Code: 48 c7 c2 60 d6 a2 c0 bf 02 00 00 00 48 c7 c6 80 f8 a9 c0 e8 e3 7d bb ff 49 8b 47 08 e9 31 e5 ff ff 0f 0b e9 b4 ec ff ff 0f 0b <0f> 0b e9 cb ec ff ff 48 8b 85 b0 fd ff ff 48 8d 8d 18 fe ff ff 48
[  669.686259] RSP: 0018:ffffb80fc1a978d0 EFLAGS: 00010082
[  669.686260] RAX: 0000000000000002 RBX: ffff9454b5d54c00 RCX: ffff9455ec2c6170
[  669.686261] RDX: 0000000000000001 RSI: 0000000000000206 RDI: ffff9455eaba6158
[  669.686261] RBP: ffffb80fc1a97b80 R08: 0000000000000005 R09: 0000000000000000
[  669.686262] R10: ffffb80fc1a97838 R11: ffffb80fc1a9783c R12: 0000000000000206
[  669.686263] R13: ffff9455ec2c6000 R14: ffff94559d443800 R15: ffff9455eda20000
[  669.686264] FS:  00007fc6a5a21f00(0000) GS:ffff9455fe980000(0000) knlGS:0000000000000000
[  669.686264] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  669.686265] CR2: 00007fc6a5991678 CR3: 00000007f0390000 CR4: 0000000000340ee0
[  669.686266] Call Trace:
[  669.686270]  ? __irq_work_queue_local+0x50/0x60
[  669.686277]  ? commit_tail+0x94/0x110 [drm_kms_helper]
[  669.686282]  commit_tail+0x94/0x110 [drm_kms_helper]
[  669.686288]  drm_atomic_helper_commit+0xb8/0x130 [drm_kms_helper]
[  669.686293]  drm_atomic_helper_set_config+0x79/0x90 [drm_kms_helper]
[  669.686304]  drm_mode_setcrtc+0x194/0x6a0 [drm]
[  669.686357]  ? amdgpu_cs_wait_ioctl+0xeb/0x160 [amdgpu]
[  669.686367]  ? drm_mode_getcrtc+0x180/0x180 [drm]
[  669.686377]  drm_ioctl_kernel+0xaa/0xf0 [drm]
[  669.686386]  drm_ioctl+0x208/0x390 [drm]
[  669.686396]  ? drm_mode_getcrtc+0x180/0x180 [drm]
[  669.686445]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[  669.686447]  do_vfs_ioctl+0x40e/0x670
[  669.686449]  ksys_ioctl+0x5e/0x90
[  669.686451]  __x64_sys_ioctl+0x16/0x20
[  669.686453]  do_syscall_64+0x52/0x160
[  669.686454]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  669.686455] RIP: 0033:0x7fc6a5f6a5b7
[  669.686457] Code: 00 00 90 48 8b 05 d9 78 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a9 78 0c 00 f7 d8 64 89 01 48
[  669.686457] RSP: 002b:00007ffd36fb37a8 EFLAGS: 00003246 ORIG_RAX: 0000000000000010
[  669.686458] RAX: ffffffffffffffda RBX: 00007ffd36fb37e0 RCX: 00007fc6a5f6a5b7
[  669.686459] RDX: 00007ffd36fb37e0 RSI: 00000000c06864a2 RDI: 000000000000000d
[  669.686459] RBP: 00000000c06864a2 R08: 0000000000000000 R09: 000055c668ad0740
[  669.686460] R10: 0000000000000000 R11: 0000000000003246 R12: 0000000000000000
[  669.686461] R13: 000000000000000d R14: 000055c668a607d0 R15: 0000000000000000
[  669.686462] ---[ end trace 47feccd771299f6c ]---
Comment 1 rLy 2019-11-14 17:28:03 UTC
I experience the same in CS:GO and Shadow of the tomb raider. What I noticed is that the crash only happens when I move the mouse. Also it's somewhat related to the mouse pointer. For example starting a CS:GO game using console commands thus getting rid of the system pointer doesn't result in a crash, but as soon as I open the buy menu or bringing up steam overlay which shows the system pointer it crashes. Neither CS:GO or SOTTR has ingame pointers. Not sure how relevant mouse pointer type part is because there's some exception to this with other games that I tested based on this.
 
GTA5 - ingame pointer - no crash
Europa Universalis 4 - ingame pointer - no crash
Train Simulator 2020 - no ingame pointer - crash
exceptions:
Kerbal Space Program - ingame pointer - crash
Oxygen Not Included -no ingame pointer - no crash

Workaround that worked for me is setting Option "SWCursor" "True" for xorg.

Specs:
GPU Sapphire Pulse RX 5700 XT
Archlinux with mesa-git repo
DE: KDE
kernel 5.4-rc7
mesa-git 1:20.0.0_devel.117467.9e440b8d0b9-1
llvm-git 10.0.0_r331530.6ef63638cb8-1
Comment 2 Shmerl 2019-11-14 17:38:21 UTC
I'll give it a test, thanks. Pathfinder: Kingmaker is using in game custom cursor for the reference and it's a Unity game.
Comment 3 Pierre-Eric Pelloux-Prayer 2019-11-14 17:46:52 UTC
This bug looks similar to this one: https://bugzilla.kernel.org/show_bug.cgi?id=205169

2 possible workarounds to test:
- do not run the game in fullscreen
- revert this kernel commit: https://bugzilla.kernel.org/show_bug.cgi?id=205169#c10
Comment 4 Shmerl 2019-11-14 18:12:12 UTC
(In reply to Pierre-Eric Pelloux-Prayer from comment #3)
> - revert this kernel commit:
> https://bugzilla.kernel.org/show_bug.cgi?id=205169#c10

Just for the reference:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=617089d5837a

Thanks, I'll give reverting it a try later today. So that fix for the infinite loop introduced a regression? Is there a better way that prevents both?
Comment 5 Shmerl 2019-11-15 01:18:06 UTC
Just tested it, and can confirm, that reverting that commit indeed prevents the hang in Pathfinder: Kingmaker!
Comment 6 Alex Deucher 2019-11-15 16:04:27 UTC
Created attachment 145971 [details] [review]
possible fix

Does this patch help?
Comment 7 Shmerl 2019-11-15 19:05:59 UTC
(In reply to Alex Deucher from comment #6)
> Created attachment 145971 [details] [review] [review]
> possible fix
> 
> Does this patch help?

Just applied that patch on on top of latest 5.4-rc7+ and tested it. It prevents the hang in Pathfinder: Kingmaker! Thanks!
Comment 8 rLy 2019-11-15 21:22:03 UTC
(In reply to Alex Deucher from comment #6)
> Created attachment 145971 [details] [review] [review]
> possible fix
> 
> Does this patch help?

I tried it as well on 5.4-rc7 and fixed every game I mentioned(CS:GO, SOTTR, TS2020, KSP).
Comment 9 Jan Kowalski 2019-11-16 11:39:43 UTC
I had this bug on tested kernels: 5.3.8, 5.3.9, 5.4.0-rc6, 5.4.0-rc7. I can confirm hangs occurred, when the cursor was moved after some time of inactivity (10-60 sec, depending on kernel version. 5.4.0-rc6 was the worst) in Firefox when watching fullscreen video (excluding 5.4.0-rc7), or in games. This bug was accompanied by another one - sluggish cursor, in 5.3 line with screen flickering at the cursor position, in 5.4 line barely visible, occurring every few seconds but without screen flickering, and feels inaccurate.

I added this patch https://bugs.freedesktop.org/attachment.cgi?id=145971 on top of 5.4.0-rc7 and hangs are gone (thanks), but laggy cursor is still there.

Sapphire Pulse RX 5700 XT
Dell U2311Hb, 1920x1080 60.00Hz
Kernel: 5.4.0-rc7
Mesa: 19.3.0-rc2
llvm: 9.0.0
Comment 10 Martin Peres 2019-11-19 10:01:42 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/955.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.