On a regular (almost daily) basis, my two screens turns black and no movement of the mouse or keyboard brings it back to life. The system itself seems to be still running properly, I can access the system remotely and I can shut it down cleanly. System is a HPZ620 workstation, OS is a current Arch Linux (rolling release with up-to-date drivers, kernel 4.18.6), card is a RX Vega 64, the two screens are connected via displayport. The following error is the only one that occurs each time when the screens give up: Sep 14 10:16:02 vmserver kernel: [drm:generic_reg_wait [amdgpu]] *ERROR* REG_WAIT timeout 10us * 3500 tries - dce_mi_free_dmif line:636 Sep 14 10:16:02 vmserver kernel: WARNING: CPU: 10 PID: 785 at drivers/gpu/drm/amd/amdgpu/../display/dc/dc_helper.c:254 generic_reg_wait+0xe7/0x160 [amdgpu] Sep 14 10:16:02 vmserver kernel: Modules linked in: nls_utf8 ntfs fuse vhost_net vhost tap tun devlink nf_tables nfnetlink ip6table_filter ip6_tables iptable_filter ses enclosure bridge stp llc joydev input_leds led_class iTCO_wdt iTCO_vendor_support me> Sep 14 10:16:02 vmserver kernel: hid_roccat snd_seq_device hid_roccat_common uas snd_hda_codec media snd_hda_core lpc_ich snd_hwdep usb_storage snd_pcm mei_me mousedev snd_timer e1000e snd ioatdma mei soundcore wmi dca pcc_cpufreq evdev mac_hid ip_tabl> Sep 14 10:16:02 vmserver kernel: CPU: 10 PID: 785 Comm: Xorg Not tainted 4.18.6-arch1-1-ARCH #1 Sep 14 10:16:02 vmserver kernel: Hardware name: Hewlett-Packard HP Z620 Workstation/158A, BIOS J61 v03.91 10/17/2016 Sep 14 10:16:02 vmserver kernel: RIP: 0010:generic_reg_wait+0xe7/0x160 [amdgpu] Sep 14 10:16:02 vmserver kernel: Code: 44 24 58 8b 54 24 48 89 de 44 89 4c 24 08 48 8b 4c 24 50 48 c7 c7 18 8f 79 c0 e8 04 d0 d2 ff 83 7d 20 01 44 8b 4c 24 08 74 02 <0f> 0b 48 83 c4 10 44 89 c8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 41 0f Sep 14 10:16:02 vmserver kernel: RSP: 0018:ffff97fe1038f8e0 EFLAGS: 00010297 Sep 14 10:16:02 vmserver kernel: RAX: 0000000000000000 RBX: 000000000000000a RCX: 0000000000000001 Sep 14 10:16:02 vmserver kernel: RDX: 0000000000000000 RSI: ffffffffaa08051e RDI: 00000000ffffffff Sep 14 10:16:02 vmserver kernel: RBP: ffff953ca6054f00 R08: ffffffffa96ddf70 R09: 0000000000000002 Sep 14 10:16:02 vmserver kernel: R10: 0000000000000004 R11: ffffffffaa804f2d R12: 0000000000000dad Sep 14 10:16:02 vmserver kernel: R13: 00000000000035b0 R14: 0000000000000010 R15: 0000000000000001 Sep 14 10:16:02 vmserver kernel: FS: 00007f36106dee00(0000) GS:ffff954caf000000(0000) knlGS:0000000000000000 Sep 14 10:16:02 vmserver kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Sep 14 10:16:02 vmserver kernel: CR2: 00007f09c946b000 CR3: 0000000ee1f72003 CR4: 00000000001626e0 Sep 14 10:16:02 vmserver kernel: Call Trace: Sep 14 10:16:02 vmserver kernel: dce_mi_free_dmif+0xf8/0x180 [amdgpu] Sep 14 10:16:02 vmserver kernel: dce110_reset_hw_ctx_wrap+0x141/0x1b0 [amdgpu] Sep 14 10:16:02 vmserver kernel: dce110_apply_ctx_to_hw+0x52/0xa30 [amdgpu] Sep 14 10:16:02 vmserver kernel: ? hwmgr_handle_task+0x6b/0xc0 [amdgpu] Sep 14 10:16:02 vmserver kernel: ? pp_dpm_dispatch_tasks+0x41/0x60 [amdgpu] Sep 14 10:16:02 vmserver kernel: ? amdgpu_pm_compute_clocks.part.8+0xb7/0x530 [amdgpu] Sep 14 10:16:02 vmserver kernel: dc_commit_state+0x2d1/0x550 [amdgpu] Sep 14 10:16:02 vmserver kernel: amdgpu_dm_atomic_commit_tail+0x37c/0xd70 [amdgpu] Sep 14 10:16:02 vmserver kernel: ? preempt_count_add+0x68/0xa0 Sep 14 10:16:02 vmserver kernel: ? _raw_spin_lock_irq+0x1a/0x40 Sep 14 10:16:02 vmserver kernel: ? _raw_spin_unlock_irq+0x1d/0x30 Sep 14 10:16:02 vmserver kernel: ? wait_for_common+0x113/0x190 Sep 14 10:16:02 vmserver kernel: ? _raw_spin_unlock_irq+0x1d/0x30 Sep 14 10:16:02 vmserver kernel: ? wait_for_common+0x113/0x190 Sep 14 10:16:02 vmserver kernel: commit_tail+0x3d/0x70 [drm_kms_helper] Sep 14 10:16:02 vmserver kernel: drm_atomic_helper_commit+0x103/0x110 [drm_kms_helper] Sep 14 10:16:02 vmserver kernel: drm_atomic_connector_commit_dpms+0xdb/0x100 [drm] Sep 14 10:16:02 vmserver kernel: drm_mode_obj_set_property_ioctl+0x178/0x280 [drm] Sep 14 10:16:02 vmserver kernel: ? drm_mode_connector_set_obj_prop+0x80/0x80 [drm] Sep 14 10:16:02 vmserver kernel: drm_mode_connector_property_set_ioctl+0x39/0x60 [drm] Sep 14 10:16:02 vmserver kernel: drm_ioctl_kernel+0xa7/0xf0 [drm] Sep 14 10:16:02 vmserver kernel: drm_ioctl+0x30e/0x3c0 [drm] Sep 14 10:16:02 vmserver kernel: ? drm_mode_connector_set_obj_prop+0x80/0x80 [drm] Sep 14 10:16:02 vmserver kernel: amdgpu_drm_ioctl+0x49/0x80 [amdgpu] Sep 14 10:16:02 vmserver kernel: do_vfs_ioctl+0xa4/0x620 Sep 14 10:16:02 vmserver kernel: ? syscall_slow_exit_work+0x19b/0x1b0 Sep 14 10:16:02 vmserver kernel: ksys_ioctl+0x60/0x90 Sep 14 10:16:02 vmserver kernel: __x64_sys_ioctl+0x16/0x20 Sep 14 10:16:02 vmserver kernel: do_syscall_64+0x5b/0x170 Sep 14 10:16:02 vmserver kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 Sep 14 10:16:02 vmserver kernel: RIP: 0033:0x7f3613ec979b Sep 14 10:16:02 vmserver kernel: Code: 0f 1e fa 48 8b 05 c5 b6 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 95 b6 0c 00 f7 d8 64 89 01 48 Sep 14 10:16:02 vmserver kernel: RSP: 002b:00007ffd288c2698 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 Sep 14 10:16:02 vmserver kernel: RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f3613ec979b Sep 14 10:16:02 vmserver kernel: RDX: 00007ffd288c26d0 RSI: 00000000c01064ab RDI: 000000000000000f Sep 14 10:16:02 vmserver kernel: RBP: 00007ffd288c26d0 R08: 00007ffd288c2670 R09: 00007ffd288c266c Sep 14 10:16:02 vmserver kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00000000c01064ab Sep 14 10:16:02 vmserver kernel: R13: 000000000000000f R14: 0000000000000000 R15: 0000000000000000 Sep 14 10:16:02 vmserver kernel: ---[ end trace d9f6af2a868d7e60 ]--- If there is any other information I should provide, please let me know.
Please attach the full dmesg output.
Created attachment 141560 [details] Dmesg Output
I hope it's ok if I bump this bug report. It's been almost a month since I reported it and I haven't received any feedback since. Nothing really new I can report as such otherwise. The problem still exists, the driver crashes regularly. I noticed a certain possible correlation with usage: When I'm working on the computer, the problem may happen several times a day. When I'm not touching it, it may take up to 2-3 days for the driver to crash. Sometimes logging on through ssh and restarting the desktop manager (lxdm) helps already. But most of the times, only a reboot solves the issue.
@ Vik-T: The way you describe your symptoms let it seem possible to me that you are experiencing the very same long-standing bug that I reported in https://bugs.freedesktop.org/show_bug.cgi?id=102322 If you want to verify if that bug and yours are actually the same, you could try the following: (a) Check whether you experience your bug also after disabling dynamic power management. To do this, switch to manual power management like this: > cd /sys/class/drm/card0/device > echo manual >power_dpm_force_performance_level > echo 0 >pp_dpm_mclk > echo 0 >pp_dpm_sclk In my case, the bug does not occur while clocks are set manually. Cave: These settings are ignored/overwritten by the amdgpu driver after each display mode change and each off/on of display output or monitor. So this test has meaning only if manual settings are re-activated after each such display mode change / on-switching. (This bug I reported with https://bugs.freedesktop.org/show_bug.cgi?id=107141 ) (b) You could check if you can reproduce the symptom more quickly with a certain load pattern: (1) Enable dynamic power management (which is also the default) (2) Start X11, but not any client (or desktop environment) that draws anything on the screen (3) Replay an (at least 1080p) video with only 3 frames per second, e.g. via: "mpv --no-correct-pts --fps=3 --ao=null some_arbitrary_video.webm" This kind of load causes (at least in the case of my system) frequent changes to the pp_dpm_mclk and pp_dpm_sclk values, and the system crashes after only a short while (seconds up to 15 minutes) under this kind of load, with the symptom (blanked screen, system crash) you described.
@dwagner: Thanks for your comment. I could not reproduce the error with the 3fps 1080p video on naked X. I let it run for 25 minutes without any issues. Besides, unlike yourself, I never experienced any sort of full system crash. I can always shut down and reboot cleanly. As I mentioned, what happens, at random, is that the screens go blank and the log shows the call trace I posted in the first message. But the system never fully crashed so far. I can usually ssh into the box. Sometimes, a restart of the desktop manager solves the issue, but more often a reboot is necessary.
@ Vik-T: Thanks for testing, so at least we know it's a different bug that haunts your system.
I am able to reproduce this bug report in every detail on my machine. The only difference is that I am never present to directly observe the driver deadlock; it always occurs when I have left the machine idle for at least a few hours. Both tests dwagner proposed yielded negative results. I am attaching dmesg logs from the most recent instance of the problem. Please advise. I run Gentoo, and am able to easily introduce patches into any part of the system for testing.
Created attachment 142018 [details] Trimmed dmesg logs
Created attachment 142022 [details] Full dmesg logs
I've determined that the deadlock and stack trace found in my dmesg logs is emitted precisely when I attempt to wake the machine's display from sleep by touching the keyboard or mouse, and not before. If I leave the machine on a terminal console instead of in a running X session, the display never sleeps, and the deadlock never occurs. The first instance of the deadlock on my machine occurred during a session following an upgrade of the xf86-video-amdgpu drivers from version 18.0.1 to 18.1.0, and simultaneously, of Mesa to a checkout of the master branch ca. commit 0d495bec25bd7584de4e988c2b4528c1996bc1d0, or approximately 2018-09-26 04:16 UTC. I am attempting now to revert both of these upgrades one at a time in order to determine whether either of them is implicated.
Downgrading from x11-drivers/xf86-video-amdgpu-18.1.0 to x11-drivers/xf86-video-amdgpu-18.0.1 has prevented the issue from occurring on my system.
(In reply to Matthew Vaughn from comment #11) > Downgrading from x11-drivers/xf86-video-amdgpu-18.1.0 to > x11-drivers/xf86-video-amdgpu-18.0.1 has prevented the issue from occurring > on my system. Can you bisect xf86-video-amdgpu?
(In reply to Matthew Vaughn from comment #11) > Downgrading from x11-drivers/xf86-video-amdgpu-18.1.0 to > x11-drivers/xf86-video-amdgpu-18.0.1 has prevented the issue from occurring > on my system. Premature. Does not work consistently.
Same thing on Fedora 29. 4.18.16-300.fc29.x86_64; also RX Vega 64. One monitor, connected either via DP or MDP.
Someone might find this helpful: I managed to reduce the number of driver crashes considerably by disabling "suspend" and "off" mode in X. Section "ServerFlags" Option "SuspendTime" "0" Option "OffTime" "0" EndSection In the last 5-6 days, the driver crashed only once and I managed to bring X back without rebooting. For me, that's a huge improvement over the situation before where I had to reboot 2-3 times a day.
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/525.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.