Bug 111804 - Annoying GPU stucks are continued on Vega 20 with Kernel 5.4 + mesa 9.2.0 RC4 + llvm 9.0.0 [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:47:crtc-0] flip_done timed out
Summary: Annoying GPU stucks are continued on Vega 20 with Kernel 5.4 + mesa 9.2.0 RC4...
Status: RESOLVED MOVED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: XOrg git
Hardware: Other All
: not set not set
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-09-24 18:53 UTC by mikhail.v.gavrilov
Modified: 2019-11-19 09:53 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg (190.08 KB, text/plain)
2019-09-24 18:53 UTC, mikhail.v.gavrilov
no flags Details
./umr -O halt_waves -wa (242 bytes, text/plain)
2019-09-24 18:54 UTC, mikhail.v.gavrilov
no flags Details
./umr -R gfx[.] (1.53 KB, text/plain)
2019-09-24 18:55 UTC, mikhail.v.gavrilov
no flags Details
./umr -O many,bits -r *.*.mmGRBM_STATUS* (8.92 KB, text/plain)
2019-09-24 18:55 UTC, mikhail.v.gavrilov
no flags Details
./umr -O many,bits -r *.*.mmCP_EOP_* (1.73 KB, text/plain)
2019-09-24 18:56 UTC, mikhail.v.gavrilov
no flags Details
./umr -O many,bits -r *.*.mmCP_PFP_HEADER_DUMP (281 bytes, text/plain)
2019-09-24 18:57 UTC, mikhail.v.gavrilov
no flags Details
./umr -O many,bits -r *.*.mmCP_ME_HEADER_DUMP (279 bytes, text/plain)
2019-09-24 18:57 UTC, mikhail.v.gavrilov
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description mikhail.v.gavrilov 2019-09-24 18:53:35 UTC
Created attachment 145494 [details]
dmesg

What irony, while I uploaded the logs to the bugreport [1] on the machine where I was dumping the logs, another GPU Vega 20 also hung but with a different error. There no games was launched, only the terminal and google chrome.


[51444.693417] [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:47:crtc-0] flip_done timed out
[51447.765592] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:47:crtc-0] flip_done timed out
[51577.192782] sysrq: Show Blocked State
[51577.192869]   task                        PC stack   pid father
[51613.081407] sysrq: Show Blocked State
[51613.081417]   task                        PC stack   pid father
[51614.773136] perf: interrupt took too long (7178937 > 6588120), lowering kernel.perf_event_max_sample_rate to 1000
[51621.729405] snd_hda_intel 0000:0b:00.1: Refused to change power state, currently in D0
[51626.927361] snd_hda_intel 0000:0b:00.1: Refused to change power state, currently in D0
[51747.797386] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [PLANE:45:plane-5] flip_done timed out
[51879.183299] ------------[ cut here ]------------
[51879.183498] WARNING: CPU: 5 PID: 1938 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:5851 amdgpu_dm_atomic_commit_tail.cold+0x1f/0xde [amdgpu]
[51879.183502] Modules linked in: uinput rfcomm xt_CHECKSUM xt_MASQUERADE nf_nat_tftp nf_conntrack_tftp tun bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_REJECT nf_reject_ipv6 ip6t_rpfilter ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter cmac bnep sunrpc vfat fat edac_mce_amd kvm_amd snd_hda_codec_realtek snd_hda_codec_generic rtwpci kvm ledtrig_audio snd_hda_codec_hdmi rtw88 snd_hda_intel snd_intel_nhlt irqbypass mac80211 snd_hda_codec btusb btrtl snd_hda_core btbcm btintel crct10dif_pclmul snd_hwdep bluetooth snd_seq joydev eeepc_wmi xpad crc32_pclmul ff_memless snd_seq_device asus_wmi snd_pcm cfg80211 sparse_keymap ecdh_generic ghash_clmulni_intel snd_timer ecc video sp5100_tco pcspkr wmi_bmof snd k10temp i2c_piix4
[51879.183547]  rfkill ccp libarc4 soundcore gpio_amdpt gpio_generic acpi_cpufreq binfmt_misc ip_tables xfs libcrc32c amdgpu amd_iommu_v2 gpu_sched ttm drm_kms_helper uas drm crc32c_intel igb usb_storage nvme dca i2c_algo_bit nvme_core wmi pinctrl_amd fuse
[51879.183571] CPU: 5 PID: 1938 Comm: gnome-shell Not tainted 5.4.0-0.rc0.git4.1.fc32.x86_64 #1
[51879.183575] Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 2703 08/20/2019
[51879.183677] RIP: 0010:amdgpu_dm_atomic_commit_tail.cold+0x1f/0xde [amdgpu]
[51879.183682] Code: e0 5d 41 5c 41 5d 41 5e 41 5f c3 48 c7 c7 78 a1 a3 c0 e8 11 98 80 c5 0f 0b e9 6c 21 ee ff 48 c7 c7 78 a1 a3 c0 e8 fe 97 80 c5 <0f> 0b e9 c1 12 ee ff 48 c7 c7 78 a1 a3 c0 e8 eb 97 80 c5 0f 0b e9
[51879.183686] RSP: 0018:ffffa3f7038678c0 EFLAGS: 00010046
[51879.183691] RAX: 0000000000000024 RBX: ffff965f2b33a1f8 RCX: 0000000000000000
[51879.183694] RDX: 0000000000000000 RSI: ffff965f3abd9e48 RDI: ffff965f3abd9e48
[51879.183697] RBP: ffffa3f703867b70 R08: ffff965f3abd9e48 R09: 0000000000000000
[51879.183701] R10: 0000000000000001 R11: ffff965ecec78d08 R12: 0000000000000286
[51879.183704] R13: ffff965f2b33a000 R14: ffff9659547d4400 R15: ffff965f21740000
[51879.183708] FS:  00007faa412c5d00(0000) GS:ffff965f3aa00000(0000) knlGS:0000000000000000
[51879.183711] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[51879.183715] CR2: 000062d003847000 CR3: 000000078ed7c000 CR4: 00000000003406e0
[51879.183718] Call Trace:
[51879.183775]  ? commit_tail+0x3c/0x70 [drm_kms_helper]
[51879.183786]  commit_tail+0x3c/0x70 [drm_kms_helper]
[51879.183797]  drm_atomic_helper_commit+0xe3/0x150 [drm_kms_helper]
[51879.183818]  drm_atomic_connector_commit_dpms+0xd7/0x100 [drm]
[51879.183839]  set_property_atomic+0xcc/0x140 [drm]
[51879.183870]  drm_mode_obj_set_property_ioctl+0xcb/0x1c0 [drm]
[51879.183890]  ? drm_mode_obj_find_prop_id+0x40/0x40 [drm]
[51879.183906]  drm_ioctl_kernel+0xaa/0xf0 [drm]
[51879.183924]  drm_ioctl+0x208/0x390 [drm]
[51879.183944]  ? drm_mode_obj_find_prop_id+0x40/0x40 [drm]
[51879.183951]  ? sched_clock_cpu+0x94/0xc0
[51879.183960]  ? lockdep_hardirqs_on+0xf0/0x180
[51879.184028]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[51879.184036]  do_vfs_ioctl+0x411/0x750
[51879.184048]  ksys_ioctl+0x5e/0x90
[51879.184055]  __x64_sys_ioctl+0x16/0x20
[51879.184060]  do_syscall_64+0x5c/0xb0
[51879.184066]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[51879.184070] RIP: 0033:0x7faa44ed527b
[51879.184075] Code: 0f 1e fa 48 8b 05 0d 9c 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d dd 9b 0c 00 f7 d8 64 89 01 48
[51879.184079] RSP: 002b:00007ffd601d6df8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[51879.184083] RAX: ffffffffffffffda RBX: 00007ffd601d6e30 RCX: 00007faa44ed527b
[51879.184086] RDX: 00007ffd601d6e30 RSI: 00000000c01864ba RDI: 0000000000000009
[51879.184089] RBP: 00000000c01864ba R08: 0000000000000000 R09: 00000000c0c0c0c0
[51879.184093] R10: 00007faa44f9f9e0 R11: 0000000000000246 R12: 000055fc4a875460
[51879.184096] R13: 0000000000000009 R14: 0000000000000002 R15: 0000000000000000
[51879.184110] irq event stamp: 149948702
[51879.184115] hardirqs last  enabled at (149948701): [<ffffffff86b4630b>] _raw_spin_unlock_irqrestore+0x4b/0x60
[51879.184120] hardirqs last disabled at (149948702): [<ffffffff86b46ab3>] _raw_spin_lock_irqsave+0x23/0x83
[51879.184124] softirqs last  enabled at (149948632): [<ffffffff86e0035d>] __do_softirq+0x35d/0x45d
[51879.184130] softirqs last disabled at (149948625): [<ffffffff860f0787>] irq_exit+0xf7/0x100
[51879.184133] ---[ end trace d718e3c1cb156c2c ]---


[1] https://bugs.freedesktop.org/show_bug.cgi?id=111803
Comment 1 mikhail.v.gavrilov 2019-09-24 18:54:57 UTC
Created attachment 145495 [details]
./umr -O halt_waves -wa
Comment 2 mikhail.v.gavrilov 2019-09-24 18:55:17 UTC
Created attachment 145496 [details]
./umr -R gfx[.]
Comment 3 mikhail.v.gavrilov 2019-09-24 18:55:37 UTC
Created attachment 145497 [details]
./umr -O many,bits -r *.*.mmGRBM_STATUS*
Comment 4 mikhail.v.gavrilov 2019-09-24 18:56:24 UTC
Created attachment 145498 [details]
./umr -O many,bits -r *.*.mmCP_EOP_*
Comment 5 mikhail.v.gavrilov 2019-09-24 18:57:08 UTC
Created attachment 145499 [details]
./umr -O many,bits -r *.*.mmCP_PFP_HEADER_DUMP
Comment 6 mikhail.v.gavrilov 2019-09-24 18:57:32 UTC
Created attachment 145500 [details]
./umr -O many,bits -r *.*.mmCP_ME_HEADER_DUMP
Comment 7 Chernovsky Oleg 2019-09-24 20:48:20 UTC
Can confirm, have same issue with Vega 64 and gaming (both native and Wine + DXVK). Surprisingly, the dmesg stack mentions Slack electron app, which indeed was running in background.

dmesg stack:
[drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:47:crtc-0] flip_done timed out
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=589680, emitted seq=589681
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=5916, emitted seq=5917
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process slack pid 2028 thread slack:cs0 pid 2032
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process slack pid 2028 thread slack:cs0 pid 2032
amdgpu 0000:0d:00.0: GPU reset begin!
amdgpu 0000:0d:00.0: GPU reset begin!
[drm] Bailing on TDR for s_job:8f401, as another already in progress
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:47:crtc-0] flip_done timed out
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [PLANE:45:plane-5] flip_done timed out
------------[ cut here ]------------
WARNING: CPU: 9 PID: 937 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:5813 amdgpu_dm_atomic_commit_tail.cold+0x82/0xed [amdgpu]
Modules linked in: cmac rfcomm fuse bridge stp llc uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common snd_usb_audio videodev ...
 aesni_intel libahci libata aes_x86_64 glue_helper crypto_simd cryptd xhci_pci scsi_mod xhci_hcd
CPU: 9 PID: 937 Comm: Xorg Not tainted 5.3.0-arch1-1-ARCH #1
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X370 Gaming K4, BIOS P5.50 08/04/2019
RIP: 0010:amdgpu_dm_atomic_commit_tail.cold+0x82/0xed [amdgpu]
Code: c7 c7 58 4d 0a c1 e8 57 22 f1 db 0f 0b 41 83 7c 24 08 00 0f 85 a0 ff f1 ff e9 bb ff f1 ff 48 c7 c7 58 4d 0a c1 e8 38 22 f1 db <0f> 0b e9 3a f5 f ...
RSP: 0018:ffffa20100cc78a0 EFLAGS: 00010046
RAX: 0000000000000024 RBX: ffff92f34e662000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000086 RDI: 00000000ffffffff
RBP: ffffa20100cc7bc0 R08: 00000000000004dc R09: 0000000000000004
R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000286
R13: ffff92f24c449800 R14: ffff92f3769a0000 R15: ffff92f22460af00
FS:  00007f45b11eedc0(0000) GS:ffff92f37e840000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f5b665c8000 CR3: 00000007c99e6000 CR4: 00000000003406e0
Call Trace:
 ? commit_tail+0x3c/0x70 [drm_kms_helper]
 commit_tail+0x3c/0x70 [drm_kms_helper]
 drm_atomic_helper_commit+0x108/0x110 [drm_kms_helper]
 drm_atomic_helper_legacy_gamma_set+0x11b/0x170 [drm_kms_helper]
 drm_mode_gamma_set_ioctl+0x1a9/0x210 [drm]
 ? drm_color_lut_check+0xb0/0xb0 [drm]
 drm_ioctl_kernel+0xb8/0x100 [drm]
 drm_ioctl+0x23d/0x3d0 [drm]
 ? drm_color_lut_check+0xb0/0xb0 [drm]
 amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
 do_vfs_ioctl+0x43d/0x6c0
 ? syscall_trace_enter+0x1f2/0x2e0
 ksys_ioctl+0x5e/0x90
 __x64_sys_ioctl+0x16/0x20
 do_syscall_64+0x5f/0x1c0
 ? prepare_exit_to_usermode+0x85/0xb0
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f45b242721b

System info:

System:    Host: house-of-maker Kernel: 5.3.0-arch1-1-ARCH x86_64 bits: 64 compiler: gcc v: 9.1.0 Desktop: KDE Plasma 5.16.5 
Machine:   Type: Desktop Mobo: ASRock model: X370 Gaming K4 serial: <root required> UEFI: American Megatrends v: P5.50 
           date: 08/04/2019 
CPU:       Topology: 8-Core model: AMD Ryzen 7 1700X bits: 64 type: MT MCP arch: Zen rev: 1 L2 cache: 4096 KiB 
           flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm bogomips: 108622 
           Speed: 2513 MHz min/max: 2200/3400 MHz Core speeds (MHz): 1: 2440 2: 2570 3: 1725 4: 2371 5: 1712 6: 1740 7: 1711 
           8: 2396 9: 1711 10: 2367 11: 1862 12: 1711 13: 1754 14: 2398 15: 2682 16: 2368 
Graphics:  Device-1: Advanced Micro Devices [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64] vendor: Sapphire Limited 
           driver: amdgpu v: kernel bus ID: 0d:00.0 
           Display: x11 server: X.Org 1.20.5 driver: modesetting unloaded: fbdev,vesa resolution: 2560x1440~60Hz 
           OpenGL: renderer: Radeon RX Vega (VEGA10 DRM 3.33.0 5.3.0-arch1-1-ARCH LLVM 8.0.1) v: 4.5 Mesa 19.1.7 
           direct render: Yes
Comment 8 jamespharvey20 2019-10-12 23:42:12 UTC
Just ran into this with the Vega 64.  No games open.  Only KDE, suckless terminal, firefox, and remote-viewer.

Thankfully, I'm not sure of any negative effects.  I'm not even sure I need to reboot, and only saw this while looking at journalctl for another reason.

Currently running 5.3.0, mesa 19.2.0, and llvm 8.0.1.  Going to be upgrading to 5.3.5, 19.2.1, and 9.0.0 soon, but haven't done so yet.



Oct 11 00:13:53 newKvm kernel: [drm] amdgpu_dm_irq_schedule_work FAILED src 11
(yeah, nothing else with this message almost 2 days before this problem)

Oct 12 19:34:58 newKvm kernel: kworker/u65:4   D    0 2652517      2 0x80004080
Oct 12 19:34:58 newKvm kernel: Workqueue: events_unbound commit_work [drm_kms_helper]
Oct 12 19:34:58 newKvm kernel: Call Trace:
Oct 12 19:34:58 newKvm kernel:  ? __schedule+0x27f/0x6d0
Oct 12 19:34:58 newKvm kernel:  schedule+0x43/0xd0
Oct 12 19:34:58 newKvm kernel:  schedule_timeout+0x1cf/0x3d0
Oct 12 19:34:58 newKvm kernel:  ? collect_expired_timers+0xb0/0xb0
Oct 12 19:34:58 newKvm kernel:  wait_for_common+0xeb/0x190
Oct 12 19:34:58 newKvm kernel:  ? wake_up_q+0x60/0x60
Oct 12 19:34:58 newKvm kernel:  drm_atomic_helper_wait_for_flip_done+0x5f/0xb0 [drm_kms_helper]
Oct 12 19:34:58 newKvm kernel:  amdgpu_dm_atomic_commit_tail+0x1898/0x1d00 [amdgpu]
Oct 12 19:34:58 newKvm kernel:  ? commit_tail+0x3c/0x70 [drm_kms_helper]
Oct 12 19:34:58 newKvm kernel:  commit_tail+0x3c/0x70 [drm_kms_helper]
Oct 12 19:34:58 newKvm kernel:  process_one_work+0x1d1/0x3a0
Oct 12 19:34:58 newKvm kernel:  worker_thread+0x4a/0x3d0
Oct 12 19:34:58 newKvm kernel:  kthread+0xfb/0x130
Oct 12 19:34:58 newKvm kernel:  ? process_one_work+0x3a0/0x3a0
Oct 12 19:34:58 newKvm kernel:  ? kthread_park+0x80/0x80
Oct 12 19:34:58 newKvm kernel:  ret_from_fork+0x35/0x40
Comment 9 Martin Peres 2019-11-19 09:53:56 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/917.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.