Bug 105267

Summary: Screen flickering after getting hpd irq
Product: DRI Reporter: Ethan Hsieh <ethan.hsieh>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: major    
Priority: high CC: intel-gfx-bugs, manasi.d.navare, perry_yuan, rodrigo.vivi, sheirys2, tjaalton
Version: DRI git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: CFL i915 features: display/eDP
Attachments:
Description Flags
kern.log (drm.debug=0x14)
none
photo - screen_goes_blurry
none
video - screen goes blurry
none
kern.log (drm.debug=0x14)
none
i915_vbt.log
none
hexdump.log
none
kern.log (drm.debug=0x14)
none
kern.log (drm.debug=0xe)
none
with drm.debug=0xe
none
screen flickering none

Description Ethan Hsieh 2018-02-27 07:20:33 UTC
Created attachment 137634 [details]
kern.log (drm.debug=0x14)

The issue occurs on a laptop (eDP), NOT external monitor.

kern.log: (Please check full log as attached)
Feb 27 14:01:33 u-Precision-M5530 kernel:
[71.922052] [drm:intel_dp_hpd_pulse [i915]] got hpd irq on port A - short
[71.924280] [drm:intel_dp_hpd_pulse [i915]] got hpd irq on port A - short
[71.945713] [drm:intel_dp_hpd_pulse [i915]] got hpd irq on port A - short

Kernel: drm-intel-nightly (4.16.0-994).
http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-nightly/2018-02-27/

$ lspci -nn
Graphic: VGA compatible controller [0300]: Intel Corporation Device [8086:3e9b] (prog-if 00 [VGA controller])
3D controller [0302]: NVIDIA Corporation Device [10de:1cbb] (rev a1)
Comment 1 Ethan Hsieh 2018-02-27 07:24:34 UTC
Created attachment 137635 [details]
photo - screen_goes_blurry
Comment 2 Ethan Hsieh 2018-02-27 07:25:02 UTC
Created attachment 137636 [details]
video - screen goes blurry
Comment 3 Jani Nikula 2018-03-05 16:11:02 UTC
The video looks like a flicker more than being blurry. To me blurry is static while flicker is constantly changing. Would you describe it as a flicker?

Is this the full dmesg? There's information missing.

Does the screen recover if you switch the display off and back on?

Please try without nvidia module loaded.
Comment 4 Ethan Hsieh 2018-03-06 07:47:34 UTC
> The video looks like a flicker more than being blurry. To me blurry is static while flicker is constantly changing. Would you describe it as a flicker?
Sure...I'm not graphic expert. 

> Is this the full dmesg? There's information missing.
Yes. It's the full dmesg.

> Does the screen recover if you switch the display off and back on?
The issue can be recovered by suspend/resume (display will off and back on)

> Please try without nvidia module loaded.
Still can reproduce the issue after removing nvidia 390.25.
http://www.nvidia.com/Download/driverResults.aspx/130646/en-us
Comment 5 Jani Nikula 2018-03-08 09:43:08 UTC
Seems to be a Coffeelake with a Cannonlake PCH. (Mysteriously the device info is missing from the logs.)

I have no clue about the root cause yet, but what happens is that the eDP panel signals short pulse hotplug. This indicates the panel requests a link status check. We do this, and find the link status is in fact not good. To remedy, we try to re-train the link. The link training succeeds. This is all according to DP spec. From the logs, there is nothing out of the ordinary. But apparently the link retraining leads to flickering.

The hotplug occurs at about 71 seconds into the boot. If this is typical, this should give you enough time to do a modeset (disable/enable display, but don't suspend/resume) before this happens due to the hotplug. Does this cause flicker? Or only when the panel requests link status check?

The simple thing (for us) to do is double check the Coffeelake DDI buffer translations and link training values, especially see if there have been any recent updates to the specs.

Curiously the VBT refers to Skylake. I'm wondering where the VBT comes from and whether it's been updated for the platform at hand.
Comment 6 Jani Nikula 2018-03-08 09:46:48 UTC
My first instinct is that the speculation (outside the bug report) about this being an eDP 1.4 related thing is a red herring.

That said, there haven't been all that many eDP 1.4 panels around yet, and it's of course possible we may have overlooked a DP spec change wrt link (re)training that's specific to eDP 1.4.
Comment 7 Perry Yuan 2018-03-09 10:39:45 UTC
(In reply to Jani Nikula from comment #6)
> My first instinct is that the speculation (outside the bug report) about
> this being an eDP 1.4 related thing is a red herring.
> 
> That said, there haven't been all that many eDP 1.4 panels around yet, and
> it's of course possible we may have overlooked a DP spec change wrt link
> (re)training that's specific to eDP 1.4.

Hi Nikula :
Basing on the isolation  from Testing,the panel issue only happen on eDP1.4 panel .It cannot be reproduced with eDP1.3 panel.

So i think we need to check if the eDP1.4 protocol and i915 driver has something need to fix.

Thanks.

Perry
Comment 8 Jani Nikula 2018-03-09 12:10:21 UTC
(In reply to Perry Yuan from comment #7)
> Basing on the isolation  from Testing,the panel issue only happen on eDP1.4
> panel .It cannot be reproduced with eDP1.3 panel.

Do you get short hotplug pulses on the eDP 1.3 panel? Does that lead to link retraining? If not, then it's inconclusive.
Comment 9 Ethan Hsieh 2018-03-12 02:36:03 UTC
Hi Nikula,
I tried to reproduce the issue on laptop with eDP 1.3 panel. I cannot reproduce the issue and didn't get short hotplug pulses.
Comment 10 Jani Nikula 2018-03-13 08:29:17 UTC
Shot in the dark, please try [1]. We can't apply that as-is, but it's a data point.

[1] http://patchwork.freedesktop.org/patch/msgid/1520579339-14745-1-git-send-email-manasi.d.navare@intel.com
Comment 11 Ethan Hsieh 2018-03-14 10:40:53 UTC
Created attachment 138093 [details]
kern.log (drm.debug=0x14)

Hi Nikula,

The issue is gone after applying the patch. But, it's not easy to reproduce the issue. Sometime it takes more than 1 hr to reproduce it. So, I'll do more tests to confirm it.

Here is the test result:
1. With patch in [1]: Pass (0/7)
2. Without patch in [1]: Fail (4/6)

Here are the reproduction steps:
1. Run glxgears for 30mins with patched kernel
2. Check if screen is flickering or not
3. Reboot
3. Run glxgears for 30mins
4. Check if screen is flickering or not
5. Reboot
6. Got to 1.

Please check log as attached.
Comment 12 Ethan Hsieh 2018-03-14 10:48:24 UTC
Hi Nikula,

All logs are around 5GB. So, I only uploaded log12&13 (attached file in comment#11).

I always can get following kernel message in fail cases
$ grep -r -e "got hpd" .
./08_fail/kern.log:kernel:[ 621.812487][drm:intel_dp_hpd_pulse [i915]] got hpd irq on port A - short
./06_fail/kern.log:kernel:[2428.068094][drm:intel_dp_hpd_pulse [i915]] got hpd irq on port A - short
./12_fail/kern.log:kernel:[1318.258726][drm:intel_dp_hpd_pulse [i915]] got hpd irq on port A - short
./10_fail/kern.log:kernel:[ 723.448265][drm:intel_dp_hpd_pulse [i915]] got hpd irq on port A - short
Comment 13 Ethan Hsieh 2018-03-15 06:12:51 UTC
Hi Nikula,

The issue seems to be gone after applying patch.
I ran stress test (glxgears) for 1 hour tree times and cannot reproduce the issue.

Here is the test result:
1. With patch in [1] (1 hour): Pass (0/3)
2. Without patch in [1] ( 30 mins): Fail (2/2)

BTW, when issue occurs, it can be recovered by following command.
$ DISPLAY=:0 xset dpms force off
$ DISPLAY=:0 xset dpms force on
Comment 14 Ethan Hsieh 2018-03-15 09:34:50 UTC
Hi Jani,
Patched kernel passed 3hr test. May I know what next action is?
Comment 15 Jani Nikula 2018-03-21 11:11:06 UTC
Thanks for testing.

The problem with the patch referenced in comment #10 is that will regress older platforms. We can't do that.

The background is that we have tried to use optimal link parameters, and we have tried to optimize for both fewer lanes with higher rate, and more lanes with lower rate. All of this failed, until we learned that, uh, a certain other OS always used the maximum link parameters reported by the display. Apparently that was the only configuration that the panel/laptop vendors then ended up validating. We switched to using max link rate and lane count, which presumably correspond to the native resolution of the display anyway, and we haven't had issues with that approach until now.

Arguably all the displays should work with all the lane counts and rates they report, but sadly this appears not to be the case. In this bug, the display does not work with the maximum parameters it reports.

Apparently nobody has double checked the DDI buffer translation and voltage swing etc. parameters that I suggested in comment #5. :(

eDP 1.4 also adds two somewhat related features. Link rate select to support more intermediate rates between what's available for DP. DSC to support stream compression.

1) Please attach /sys/kernel/debug/dri/0/i915_vbt. The VBT is supposed to contain the port specific maximums, but perhaps that's not being used.

2) Please see that you have CONFIG_DRM_DP_AUX_CHARDEV=y, and try to use /dev/drm_dp_auxN node to hexdump the DPCD, and attach them.
Comment 16 Ethan Hsieh 2018-03-21 11:41:22 UTC
Created attachment 138241 [details]
i915_vbt.log

cat /sys/kernel/debug/dri/0/i915_vbt > i915_vbt.log
Comment 17 Ethan Hsieh 2018-03-21 11:42:23 UTC
Created attachment 138242 [details]
hexdump.log

Yes. CONFIG_DRM_DP_AUX_CHARDEV=y

hexdump /dev/drm_dp_aux0
hexdump /dev/drm_dp_aux1
hexdump /dev/drm_dp_aux2
hexdump /dev/drm_dp_aux3

Please refer to attached file.
Comment 18 Jani Saarinen 2018-03-29 07:11:41 UTC
First of all. Sorry about spam.
This is mass update for our bugs. 

Sorry if you feel this annoying but with this trying to understand if bug still valid or not.
If bug investigation still in progress, please ignore this and I apologize!

If you think this is not anymore valid, please comment to the bug that can be closed.
If you haven't tested with our latest pre-upstream tree(drm-tip), can you do that also to see if issue is valid there still and if you cannot see issue there, please comment to the bug.
Comment 19 Perry Yuan 2018-04-02 08:14:20 UTC
(In reply to Jani Nikula from comment #15)
> Thanks for testing.
> 
> The problem with the patch referenced in comment #10 is that will regress
> older platforms. We can't do that.
> 
> The background is that we have tried to use optimal link parameters, and we
> have tried to optimize for both fewer lanes with higher rate, and more lanes
> with lower rate. All of this failed, until we learned that, uh, a certain
> other OS always used the maximum link parameters reported by the display.
> Apparently that was the only configuration that the panel/laptop vendors
> then ended up validating. We switched to using max link rate and lane count,
> which presumably correspond to the native resolution of the display anyway,
> and we haven't had issues with that approach until now.
> 
> Arguably all the displays should work with all the lane counts and rates
> they report, but sadly this appears not to be the case. In this bug, the
> display does not work with the maximum parameters it reports.
> 
> Apparently nobody has double checked the DDI buffer translation and voltage
> swing etc. parameters that I suggested in comment #5. :(
> 
> eDP 1.4 also adds two somewhat related features. Link rate select to support
> more intermediate rates between what's available for DP. DSC to support
> stream compression.
> 
> 1) Please attach /sys/kernel/debug/dri/0/i915_vbt. The VBT is supposed to
> contain the port specific maximums, but perhaps that's not being used.
> 
> 2) Please see that you have CONFIG_DRM_DP_AUX_CHARDEV=y, and try to use
> /dev/drm_dp_auxN node to hexdump the DPCD, and attach them.

Hi  Jani:
If the patch has regression effect,then what we can do to fix the issue ?


Perry
Comment 20 Jani Nikula 2018-04-04 14:12:24 UTC
Please try this patch to debug. Is the issue reproducible with this? Either way, please attach dmesg running this, with drm.debug=14 module parameter set.

diff --git a/drivers/gpu/drm/i915/intel_dp.c b/drivers/gpu/drm/i915/intel_dp.c
index 62f82c4298ac..78ee270fefc3 100644
--- a/drivers/gpu/drm/i915/intel_dp.c
+++ b/drivers/gpu/drm/i915/intel_dp.c
@@ -1806,7 +1806,7 @@ intel_dp_compute_config(struct intel_encoder *encoder,
                 * configuration, and typically these values correspond to the
                 * native resolution of the panel.
                 */
-               min_lane_count = max_lane_count;
+               min_lane_count = max_lane_count = 2;
                min_clock = max_clock;
        }
Comment 21 Jani Nikula 2018-04-04 14:13:58 UTC
For Rodrigo, Manasi, et al: the purpose in comment #20 is to ensure we can indeed reach the highest clock.
Comment 22 Ethan Hsieh 2018-04-10 07:11:42 UTC
Created attachment 138719 [details]
kern.log (drm.debug=0x14)

Hi Jani

With the patch in comment#20, screen becomes black after booting to kernel.
Comment 23 Jani Saarinen 2018-04-24 06:57:27 UTC
Jani, any advice to progress here?
Comment 24 Jani Nikula 2018-05-09 06:13:36 UTC
Highest priority for consideration
Comment 25 Jani Nikula 2018-05-09 07:21:41 UTC
Please try drm-tip branch of [1]. I presume that will still fail, but I've been wrong before, so let's make sure.

After that, please try patch [2] on top of drm-tip. I presume this will fix the issue. But let's make sure. ;)

If all this helps, we'll still need to figure out how to backport this to older kernels as needed. But first things first, let's figure this out on current drm-tip.

[1] https://cgit.freedesktop.org/drm/drm-tip
[2] http://patchwork.freedesktop.org/patch/msgid/20180509071321.28563-1-jani.nikula@intel.com
Comment 26 Ethan Hsieh 2018-05-11 10:41:56 UTC
Cannot reproduce the issue with both of [1] and [2].
Run glxgears for 1 hour.
[1]: Pass (2/2)
[2]: Pass (2/2)
Comment 27 Jani Nikula 2018-05-14 13:16:08 UTC
(In reply to Ethan Hsieh from comment #26)
> Cannot reproduce the issue with both of [1] and [2].
> Run glxgears for 1 hour.
> [1]: Pass (2/2)
> [2]: Pass (2/2)

That's a surprise. I expected [1] to fail and [2] to fix it. It appears we already have something that fixes the issue in drm-tip.
Comment 28 Francesco Balestrieri 2018-05-15 08:23:55 UTC
I'm inclined to mark this resolved if there is no objection.
Comment 29 Jani Nikula 2018-05-16 07:21:38 UTC
Please post the dmesg with drm.debug=14 for running drm-tip. Is this for sure the same configuration that fails on older kernels?
Comment 30 Ethan Hsieh 2018-05-17 01:57:04 UTC
Created attachment 139606 [details]
kern.log (drm.debug=0xe)

Please refer to the attached file for drm-tip's kernel log.
I always use same machine and configuration to reproduce the issue.
Comment 31 Jani Nikula 2018-05-17 08:52:34 UTC
(In reply to Ethan Hsieh from comment #30)
> Created attachment 139606 [details]
> kern.log (drm.debug=0xe)
> 
> Please refer to the attached file for drm-tip's kernel log.
> I always use same machine and configuration to reproduce the issue.

Well, for some reason or another, we don't get the hotplug irq from the panel here like we do in the failing case. That's what the panel uses to indicate the link is not good, and sets the failure in motion.
Comment 32 sheirys2@gmail.com 2018-06-20 16:12:42 UTC
Created attachment 140249 [details]
with drm.debug=0xe
Comment 33 sheirys2@gmail.com 2018-06-20 16:45:05 UTC
Hello I think I am also affected by this bug, or similar.

Screen flickers constantly in random intervals. 
Flickering does not appear while running windows or in bios.
To describe "flick" - display or its part becomes black or filled with random pixels for a very short time. Sometimes "flick" appears more then once per second.

When "flick" appears dmesg (with drm.debug=0xe) produces:

...
birž. 19 21:29:04 localhost.localdomain kernel: [drm:drm_mode_addfb2 [drm]] [FB:74]
birž. 19 21:29:04 localhost.localdomain kernel: [drm:gen8_irq_handler [i915]] hotplug event received, stat 0x01000000, dig 0x11101010, pins 0x00000010
birž. 19 21:29:04 localhost.localdomain kernel: [drm:intel_hpd_irq_handler [i915]] digital hpd port A - short
birž. 19 21:29:04 localhost.localdomain kernel: [drm:intel_dp_hpd_pulse [i915]] got hpd irq on port A - short
birž. 19 21:29:04 localhost.localdomain kernel: [drm:intel_dp_read_dpcd [i915]] DPCD: 12 0a 84 41 00 00 01 01 02 00 00 00 00 0b 00
birž. 19 21:29:04 localhost.localdomain kernel: [drm:drm_mode_addfb2 [drm]] [FB:72]
...

Arch: x86_64
Kern: 4.16.15-300.fc28.x86_64
Dist: fedora 28
Machine: Lenovo yoga 900s-12isk

xrandr --verbose:
Screen 0: minimum 320 x 200, current 2560 x 1440, maximum 8192 x 8192
XWAYLAND0 connected 2560x1440+0+0 (0x22) normal (normal left inverted right x axis y axis) 280mm x 160mm
	Identifier: 0x21
	Timestamp:  21014
	Subpixel:   unknown
	Gamma:      1.0:1.0:1.0
	Brightness: 0.0
	Clones:    
	CRTC:       0
	CRTCs:      0
	Transform:  1.000000 0.000000 0.000000
	            0.000000 1.000000 0.000000
	            0.000000 0.000000 1.000000
	           filter: 
  2560x1440 (0x22) 312.250MHz -HSync +VSync *current +preferred
        h: width  2560 start 2752 end 3024 total 3488 skew    0 clock  89.52KHz
        v: height 1440 start 1443 end 1448 total 1493           clock  59.96Hz
Comment 34 sheirys2@gmail.com 2018-06-20 16:46:21 UTC
Created attachment 140250 [details]
screen flickering
Comment 35 Rodrigo Vivi 2018-06-27 17:22:04 UTC
sheirys2 it seems this got fixed on latest kernels.
Could you please try with latest drm-tip?
Comment 36 Lakshmi 2018-08-26 05:26:24 UTC
Reporter, can you check if this issue is still reproducible with latest drm-tip?
Comment 37 Lakshmi 2018-09-10 06:40:24 UTC
Ethan, Ping?
Comment 38 sheirys2@gmail.com 2018-09-10 06:58:55 UTC
Hai, sorry for late reply.

>> sheirys2 it seems this got fixed on latest kernels.
>> Could you please try with latest drm-tip?

Can you provide information how to do that?

Also, I do not know if it is related, I installed archlinux with gnome3 and by default screen rotation does not work and screen flickering is gone. But after I installed `iio-sensor-proxy` rotation starts working and screen starts to flicker again. So for now fix for me is to remove `iio-sensor-proxy` package.
Comment 39 Ethan Hsieh 2018-09-21 09:52:17 UTC
Hi Lakshmi,

I used DVT1 device to reproduce the issue and the issue can be reproduced on DVT1 easily. But, I only have DVT2 on hand now. The failure rate is very low on DVT2. Even though latest drm-tip can pass 8 hours test, I have no confidence that the issue is fixed really by latest drm-tip.
Comment 40 Jani Nikula 2018-09-28 06:57:34 UTC
I'm assuming this is fixed by

commit 7769db5883841b03de544a35a71ff528d4131c17
Author: Jani Nikula <jani.nikula@intel.com>
Date:   Wed Sep 5 12:53:21 2018 +0300

    drm/i915/dp: optimize eDP 1.4+ link config fast and narrow

that I just pushed.

Please reopen if the problem still persists with that commit or current drm-tip.
Comment 41 Albert Astals Cid 2019-03-06 21:47:02 UTC
I just updated to Linux 5.0 and got a black screen when starting X.

I did a git bisect and found out this patch seems to be the culprit.

I'm running an XPS 15 9570 (several other people seem to be having the same problem).

Anything i can do to help diagnose what's wrong?
Comment 42 Albert Astals Cid 2019-03-06 23:40:56 UTC
FWIW this "fixes" it for me https://invent.kde.org/snippets/44
Comment 43 Jani Nikula 2019-03-08 13:08:23 UTC
(In reply to Albert Astals Cid from comment #41)
> I just updated to Linux 5.0 and got a black screen when starting X.
> 
> I did a git bisect and found out this patch seems to be the culprit.
> 
> I'm running an XPS 15 9570 (several other people seem to be having the same
> problem).
> 
> Anything i can do to help diagnose what's wrong?

Please file a new bug, attach dmesg all the way from boot with drm.debug=14 module parameter set.
Comment 44 Albert Astals Cid 2019-03-11 11:07:37 UTC
And bugzilla decided not to email me so i didn't see the answer. 

Ok, so i will recompile the kernel without my workaround and give you that debug log.
Comment 45 Albert Astals Cid 2019-03-11 12:16:36 UTC
New bug at https://bugs.freedesktop.org/show_bug.cgi?id=109959

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.