Created attachment 116214 [details] Xorg.log Hi, My system regularly forgets about the screen connected to its DVI output, with slightly varying symptoms. This used to happen irregularly, but since upgrading to 1.17.1 (as packaged in Debian, 2:1.17.1-2) it's systematic. Even with no power manager (I uninstalled xfce4-power-manager) and no screensaver, the screen ends up going into powersave mode and X never comes back. Sometimes switching to a text VT brings the screen back, and then restarting X works; but often switching to another VT doesn't change anything, nor does restarting X; only rebooting helps then. I'm attaching Xorg.log, the kernel log, and working and failed watermarks and register dumps. The latter reveals that the DVI output is just gone as far as the system is concerned... I've tried disconnecting and reconnecting the DVI cable, and that doesn't help either. I've noticed lots of traces like Jun 1 21:03:54 heffalump kernel: [52160.398608] [drm:drm_edid_block_valid] *ERROR* EDID checksum is invalid, remainder is 215 Jun 1 21:03:54 heffalump kernel: [52160.398611] Raw EDID: Jun 1 21:03:54 heffalump kernel: [52160.398612] 00 ff ff ff ff ff ff 00 22 f0 f7 26 01 01 01 01 Jun 1 21:03:54 heffalump kernel: [52160.398613] 25 12 01 03 80 36 23 78 ee ce 50 a3 54 4c 99 26 Jun 1 21:03:54 heffalump kernel: [52160.398614] 0f 5f ff ff ff ff ff ff ff ff ff ff ff ff ff ff Jun 1 21:03:54 heffalump kernel: [52160.398615] ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff Jun 1 21:03:54 heffalump kernel: [52160.398616] ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff Jun 1 21:03:54 heffalump kernel: [52160.398617] ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff Jun 1 21:03:54 heffalump kernel: [52160.398618] ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff Jun 1 21:03:54 heffalump kernel: [52160.398619] ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff with varying values; I don't know if that's relevant at all. Regards, Stephen
Created attachment 116215 [details] kern.log
Created attachment 116216 [details] Watermark when the screen is working
Created attachment 116217 [details] Watermark when the screen is gone
Created attachment 116218 [details] Register dump when the screen is working
Created attachment 116219 [details] Register dump when the screen is gone
I forgot to mention, this is on a Haswell Xeon E3-1245v3, on a Supermicro X10SAE board, with a HP LP2475w connected to the motherboard's DVI output.
Your kernel log doesn't have debug information for drm. Plase add drm.debug=0xe to your kernel command line, reproduce the problem again and attach the full dmesg here.
It's enough, or rather we do not log anything else that is relevant. Frequent hotplug events causing EDID reads which occasionally fail and so rendering the output disconnected. A workaround would be to save the valid edid and then force the kernel to use that via the drm_kms_helper.edid_firmware= parameter. But before doing that, I would try replacing the DVI cable.
Huh, I didn't think of checking the hardware... I've ordered a replacement DVI cable. Thanks for the suggestion! I'll let you know if it fixes things.
I've replaced the DVI cable, but it hasn't fixed things (I even tried with a third DVI cable). I can trigger this reliably by switching inputs on the monitor; this has always logged invalid EDIDs, but only recently has it caused the system to forget it's there. I've noticed that whichever VT is selected when I switch inputs is the one that goes away; if it's a console VT though, all the console VTs go and I haven't found a way to get them back. An X VT won't come back either but I can restart X and that restores the display (most of the time). Should I just go for the stored EDID approach? Or is there some change I could revert somewhere?
The source will be the invalid EDID - we use it to confirm that you have a DVI connection. You can do a bisect to find which commit aggravated the issue for you. It might be something wrong in the comms protocol, e.g. switching to using GMBUS rather than GPIO, caused more frequent failure, but if the monitor always has returned invalid EDID at some point, then I'm not optimistic that it will be that. My guess is that the bisect would report that the offending commit is one that causes to try less hard to recover a broken EDID. But maybe the bisect would be a surprise and we find a genuine bug. Whilst I would appreciate a bisect, it may take a few days to perform, longer based on how long it takes for the invalid EDID to generate a disconnect. Overriding the EDID would take a couple of minutes to setup, so I can understand if you just did the workaround.
It looks like I'm going to have to bisect this anyway, the EDID workaround isn't enough... This morning the screen was gone again! There's nothing in the kernel logs overnight, but when I touched the keyboard then started switching VTs to try to get the system back: Jun 10 06:42:35 heffalump kernel: [30650.505896] platform HDMI-A-3: firmware: direct-loading firmware edid/hp-lp2475w.edid Jun 10 06:42:35 heffalump kernel: [30650.505918] [drm] Got external EDID base block and 0 extensions from "edid/hp-lp2475w.edid" for connector "HDMI-A-3" repeated 5 times, then Jun 10 06:42:47 heffalump kernel: [30663.165540] [drm:drm_edid_block_valid] *ERROR* EDID checksum is invalid, remainder is 3 Jun 10 06:42:47 heffalump kernel: [30663.165543] Raw EDID: Jun 10 06:42:47 heffalump kernel: [30663.165544] 00 ff ff ff ff ff ff 00 22 f0 f7 26 01 01 01 01 Jun 10 06:42:47 heffalump kernel: [30663.165544] 25 12 01 03 80 36 23 78 ee ce 50 a3 54 4c 99 26 Jun 10 06:42:47 heffalump kernel: [30663.165545] 0f 50 54 a5 6b 80 81 40 a9 00 a9 40 b3 00 d1 00 Jun 10 06:42:47 heffalump kernel: [30663.165546] 01 01 01 01 01 01 28 3c ff ff ff ff ff ff ff ff Jun 10 06:42:47 heffalump kernel: [30663.165546] ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff Jun 10 06:42:47 heffalump kernel: [30663.165547] ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff Jun 10 06:42:47 heffalump kernel: [30663.165547] ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff Jun 10 06:42:47 heffalump kernel: [30663.165548] ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff (that came from the screen, the stored EDID is correct); then 11 "direct-loading firmware" messages, followed by Jun 10 06:42:54 heffalump kernel: [30669.471054] [drm] HPD interrupt storm detected on connector DP-2: switching from hotplug detection to polling then another few dozen "direct-loading" messages with 2 invalid checksum messages, and again Jun 10 06:45:48 heffalump kernel: [30844.198498] [drm] HPD interrupt storm detected on connector DP-2: switching from hotplug detection to polling before I gave up and rebooted.
Is this still an issue? Have you been able to bisect?
Closing this bug, there has been no response for several months, if the problem persist please create a new bug adding logs
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.