Bug 107928

Summary: Screen regularly turns black, reboot needed
Product: DRI Reporter: Vik-T <viktor>
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED MOVED QA Contact:
Severity: critical    
Priority: medium CC: harry.wentland, nethershaw, nicholas.kazlauskas, sunpeng.li, taijian
Version: XOrg git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Dmesg Output
none
Trimmed dmesg logs
none
Full dmesg logs none

Description Vik-T 2018-09-14 09:44:25 UTC
On a regular (almost daily) basis, my two screens turns black and no movement of the mouse or keyboard brings it back to life. The system itself seems to be still running properly, I can access the system remotely and I can shut it down cleanly. 

System is a HPZ620 workstation, OS is a current Arch Linux (rolling release with up-to-date drivers, kernel 4.18.6), card is a RX Vega 64, the two screens are connected via displayport. 

The following error is the only one that occurs each time when the screens give up: 

Sep 14 10:16:02 vmserver kernel: [drm:generic_reg_wait [amdgpu]] *ERROR* REG_WAIT timeout 10us * 3500 tries - dce_mi_free_dmif line:636
Sep 14 10:16:02 vmserver kernel: WARNING: CPU: 10 PID: 785 at drivers/gpu/drm/amd/amdgpu/../display/dc/dc_helper.c:254 generic_reg_wait+0xe7/0x160 [amdgpu]
Sep 14 10:16:02 vmserver kernel: Modules linked in: nls_utf8 ntfs fuse vhost_net vhost tap tun devlink nf_tables nfnetlink ip6table_filter ip6_tables iptable_filter ses enclosure bridge stp llc joydev input_leds led_class iTCO_wdt iTCO_vendor_support me>
Sep 14 10:16:02 vmserver kernel:  hid_roccat snd_seq_device hid_roccat_common uas snd_hda_codec media snd_hda_core lpc_ich snd_hwdep usb_storage snd_pcm mei_me mousedev snd_timer e1000e snd ioatdma mei soundcore wmi dca pcc_cpufreq evdev mac_hid ip_tabl>
Sep 14 10:16:02 vmserver kernel: CPU: 10 PID: 785 Comm: Xorg Not tainted 4.18.6-arch1-1-ARCH #1
Sep 14 10:16:02 vmserver kernel: Hardware name: Hewlett-Packard HP Z620 Workstation/158A, BIOS J61 v03.91 10/17/2016
Sep 14 10:16:02 vmserver kernel: RIP: 0010:generic_reg_wait+0xe7/0x160 [amdgpu]
Sep 14 10:16:02 vmserver kernel: Code: 44 24 58 8b 54 24 48 89 de 44 89 4c 24 08 48 8b 4c 24 50 48 c7 c7 18 8f 79 c0 e8 04 d0 d2 ff 83 7d 20 01 44 8b 4c 24 08 74 02 <0f> 0b 48 83 c4 10 44 89 c8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 41 0f 
Sep 14 10:16:02 vmserver kernel: RSP: 0018:ffff97fe1038f8e0 EFLAGS: 00010297
Sep 14 10:16:02 vmserver kernel: RAX: 0000000000000000 RBX: 000000000000000a RCX: 0000000000000001
Sep 14 10:16:02 vmserver kernel: RDX: 0000000000000000 RSI: ffffffffaa08051e RDI: 00000000ffffffff
Sep 14 10:16:02 vmserver kernel: RBP: ffff953ca6054f00 R08: ffffffffa96ddf70 R09: 0000000000000002
Sep 14 10:16:02 vmserver kernel: R10: 0000000000000004 R11: ffffffffaa804f2d R12: 0000000000000dad
Sep 14 10:16:02 vmserver kernel: R13: 00000000000035b0 R14: 0000000000000010 R15: 0000000000000001
Sep 14 10:16:02 vmserver kernel: FS:  00007f36106dee00(0000) GS:ffff954caf000000(0000) knlGS:0000000000000000
Sep 14 10:16:02 vmserver kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 14 10:16:02 vmserver kernel: CR2: 00007f09c946b000 CR3: 0000000ee1f72003 CR4: 00000000001626e0
Sep 14 10:16:02 vmserver kernel: Call Trace:
Sep 14 10:16:02 vmserver kernel:  dce_mi_free_dmif+0xf8/0x180 [amdgpu]
Sep 14 10:16:02 vmserver kernel:  dce110_reset_hw_ctx_wrap+0x141/0x1b0 [amdgpu]
Sep 14 10:16:02 vmserver kernel:  dce110_apply_ctx_to_hw+0x52/0xa30 [amdgpu]
Sep 14 10:16:02 vmserver kernel:  ? hwmgr_handle_task+0x6b/0xc0 [amdgpu]
Sep 14 10:16:02 vmserver kernel:  ? pp_dpm_dispatch_tasks+0x41/0x60 [amdgpu]
Sep 14 10:16:02 vmserver kernel:  ? amdgpu_pm_compute_clocks.part.8+0xb7/0x530 [amdgpu]
Sep 14 10:16:02 vmserver kernel:  dc_commit_state+0x2d1/0x550 [amdgpu]
Sep 14 10:16:02 vmserver kernel:  amdgpu_dm_atomic_commit_tail+0x37c/0xd70 [amdgpu]
Sep 14 10:16:02 vmserver kernel:  ? preempt_count_add+0x68/0xa0
Sep 14 10:16:02 vmserver kernel:  ? _raw_spin_lock_irq+0x1a/0x40
Sep 14 10:16:02 vmserver kernel:  ? _raw_spin_unlock_irq+0x1d/0x30
Sep 14 10:16:02 vmserver kernel:  ? wait_for_common+0x113/0x190
Sep 14 10:16:02 vmserver kernel:  ? _raw_spin_unlock_irq+0x1d/0x30
Sep 14 10:16:02 vmserver kernel:  ? wait_for_common+0x113/0x190
Sep 14 10:16:02 vmserver kernel:  commit_tail+0x3d/0x70 [drm_kms_helper]
Sep 14 10:16:02 vmserver kernel:  drm_atomic_helper_commit+0x103/0x110 [drm_kms_helper]
Sep 14 10:16:02 vmserver kernel:  drm_atomic_connector_commit_dpms+0xdb/0x100 [drm]
Sep 14 10:16:02 vmserver kernel:  drm_mode_obj_set_property_ioctl+0x178/0x280 [drm]
Sep 14 10:16:02 vmserver kernel:  ? drm_mode_connector_set_obj_prop+0x80/0x80 [drm]
Sep 14 10:16:02 vmserver kernel:  drm_mode_connector_property_set_ioctl+0x39/0x60 [drm]
Sep 14 10:16:02 vmserver kernel:  drm_ioctl_kernel+0xa7/0xf0 [drm]
Sep 14 10:16:02 vmserver kernel:  drm_ioctl+0x30e/0x3c0 [drm]
Sep 14 10:16:02 vmserver kernel:  ? drm_mode_connector_set_obj_prop+0x80/0x80 [drm]
Sep 14 10:16:02 vmserver kernel:  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
Sep 14 10:16:02 vmserver kernel:  do_vfs_ioctl+0xa4/0x620
Sep 14 10:16:02 vmserver kernel:  ? syscall_slow_exit_work+0x19b/0x1b0
Sep 14 10:16:02 vmserver kernel:  ksys_ioctl+0x60/0x90
Sep 14 10:16:02 vmserver kernel:  __x64_sys_ioctl+0x16/0x20
Sep 14 10:16:02 vmserver kernel:  do_syscall_64+0x5b/0x170
Sep 14 10:16:02 vmserver kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Sep 14 10:16:02 vmserver kernel: RIP: 0033:0x7f3613ec979b
Sep 14 10:16:02 vmserver kernel: Code: 0f 1e fa 48 8b 05 c5 b6 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 95 b6 0c 00 f7 d8 64 89 01 48 
Sep 14 10:16:02 vmserver kernel: RSP: 002b:00007ffd288c2698 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Sep 14 10:16:02 vmserver kernel: RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f3613ec979b
Sep 14 10:16:02 vmserver kernel: RDX: 00007ffd288c26d0 RSI: 00000000c01064ab RDI: 000000000000000f
Sep 14 10:16:02 vmserver kernel: RBP: 00007ffd288c26d0 R08: 00007ffd288c2670 R09: 00007ffd288c266c
Sep 14 10:16:02 vmserver kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00000000c01064ab
Sep 14 10:16:02 vmserver kernel: R13: 000000000000000f R14: 0000000000000000 R15: 0000000000000000
Sep 14 10:16:02 vmserver kernel: ---[ end trace d9f6af2a868d7e60 ]---

If there is any other information I should provide, please let me know.
Comment 1 Michel Dänzer 2018-09-14 09:51:48 UTC
Please attach the full dmesg output.
Comment 2 Vik-T 2018-09-14 10:01:44 UTC
Created attachment 141560 [details]
Dmesg Output
Comment 3 Vik-T 2018-10-07 15:48:59 UTC
I hope it's ok if I bump this bug report. It's been almost a month since I reported it and I haven't received any feedback since. 

Nothing really new I can report as such otherwise. The problem still exists, the driver crashes regularly. I noticed a certain possible correlation with usage: When I'm working on the computer, the problem may happen several times a day. When I'm not touching it, it may take up to 2-3 days for the driver to crash. 

Sometimes logging on through ssh and restarting the desktop manager (lxdm) helps already. But most of the times, only a reboot solves the issue.
Comment 4 dwagner 2018-10-07 21:10:16 UTC
@ Vik-T: The way you describe your symptoms let it seem possible to me that you are experiencing the very same long-standing bug that I reported in https://bugs.freedesktop.org/show_bug.cgi?id=102322

If you want to verify if that bug and yours are actually the same, you could try the following:

(a) Check whether you experience your bug also after disabling dynamic power management. To do this, switch to manual power management like this:

> cd /sys/class/drm/card0/device
> echo manual >power_dpm_force_performance_level
> echo 0 >pp_dpm_mclk 
> echo 0 >pp_dpm_sclk
In my case, the bug does not occur while clocks are set manually. Cave: These settings are ignored/overwritten by the amdgpu driver after each display mode change and each off/on of display output or monitor. So this test has meaning only if manual settings are re-activated after each such display mode change / on-switching. (This bug I reported with https://bugs.freedesktop.org/show_bug.cgi?id=107141 )


(b) You could check if you can reproduce the symptom more quickly with a certain load pattern: 

(1) Enable dynamic power management (which is also the default)
(2) Start X11, but not any client (or desktop environment) that draws anything on the screen
(3) Replay an (at least 1080p) video with only 3 frames per second, e.g. via:
"mpv --no-correct-pts --fps=3 --ao=null some_arbitrary_video.webm"

This kind of load causes (at least in the case of my system) frequent changes to the pp_dpm_mclk and pp_dpm_sclk values, and the system crashes after only a short while (seconds up to 15 minutes) under this kind of load, with the symptom (blanked screen, system crash) you described.
Comment 5 Vik-T 2018-10-07 22:16:44 UTC
@dwagner:

Thanks for your comment. I could not reproduce the error with the 3fps 1080p video on naked X. I let it run for 25 minutes without any issues. 

Besides, unlike yourself, I never experienced any sort of full system crash. I can always shut down and reboot cleanly. 

As I mentioned, what happens, at random, is that the screens go blank and the log shows the call trace I posted in the first message. But the system never fully crashed so far. I can usually ssh into the box. Sometimes, a restart of the desktop manager solves the issue, but more often a reboot is necessary.
Comment 6 dwagner 2018-10-08 22:02:56 UTC
@ Vik-T: Thanks for testing, so at least we know it's a different bug that haunts your system.
Comment 7 Matthew Vaughn 2018-10-14 14:16:07 UTC
I am able to reproduce this bug report in every detail on my machine. The only difference is that I am never present to directly observe the driver deadlock; it always occurs when I have left the machine idle for at least a few hours.

Both tests dwagner proposed yielded negative results.

I am attaching dmesg logs from the most recent instance of the problem.

Please advise. I run Gentoo, and am able to easily introduce patches into any part of the system for testing.
Comment 8 Matthew Vaughn 2018-10-14 14:16:52 UTC
Created attachment 142018 [details]
Trimmed dmesg logs
Comment 9 Matthew Vaughn 2018-10-14 17:26:29 UTC
Created attachment 142022 [details]
Full dmesg logs
Comment 10 Matthew Vaughn 2018-10-17 22:35:21 UTC
I've determined that the deadlock and stack trace found in my dmesg logs is emitted precisely when I attempt to wake the machine's display from sleep by touching the keyboard or mouse, and not before.

If I leave the machine on a terminal console instead of in a running X session, the display never sleeps, and the deadlock never occurs.

The first instance of the deadlock on my machine occurred during a session following an upgrade of the xf86-video-amdgpu drivers from version 18.0.1 to 18.1.0, and simultaneously, of Mesa to a checkout of the master branch ca. commit 0d495bec25bd7584de4e988c2b4528c1996bc1d0, or approximately 2018-09-26 04:16 UTC. I am attempting now to revert both of these upgrades one at a time in order to determine whether either of them is implicated.
Comment 11 Matthew Vaughn 2018-10-19 00:51:26 UTC
Downgrading from x11-drivers/xf86-video-amdgpu-18.1.0 to x11-drivers/xf86-video-amdgpu-18.0.1 has prevented the issue from occurring on my system.
Comment 12 Michel Dänzer 2018-10-19 07:23:44 UTC
(In reply to Matthew Vaughn from comment #11)
> Downgrading from x11-drivers/xf86-video-amdgpu-18.1.0 to
> x11-drivers/xf86-video-amdgpu-18.0.1 has prevented the issue from occurring
> on my system.

Can you bisect xf86-video-amdgpu?
Comment 13 Matthew Vaughn 2018-10-19 16:07:21 UTC
(In reply to Matthew Vaughn from comment #11)
> Downgrading from x11-drivers/xf86-video-amdgpu-18.1.0 to
> x11-drivers/xf86-video-amdgpu-18.0.1 has prevented the issue from occurring
> on my system.

Premature. Does not work consistently.
Comment 14 Matthew Miller 2018-11-05 20:12:51 UTC
Same thing on Fedora 29. 4.18.16-300.fc29.x86_64; also RX Vega 64. One monitor, connected either via DP or MDP.
Comment 15 Vik-T 2018-11-06 18:56:12 UTC
Someone might find this helpful: I managed to reduce the number of driver crashes considerably by disabling "suspend" and "off" mode in X. 

Section "ServerFlags"
	Option "SuspendTime" "0"
	Option "OffTime" "0"
EndSection

In the last 5-6 days, the driver crashed only once and I managed to bring X back without rebooting. For me, that's a huge improvement over the situation before where I had to reboot 2-3 times a day.
Comment 16 Martin Peres 2019-11-19 08:55:26 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/525.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.