Bug 100755

Summary: [HSW] linux next-20170407+ boots to a blank screen
Product: DRI Reporter: Rafael Ristovski <rafael.ristovski>
Component: DRM/IntelAssignee: Clinton Taylor <clinton.a.taylor>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: critical    
Priority: high CC: dhinakaran.pandiyan, intel-gfx-bugs, jani.nikula, rafael.ristovski
Version: DRI git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard: ReadyForDev
i915 platform: HSW i915 features: display/Other
Attachments:
Description Flags
next-20170324 with fastboot=1, shows suspend/resume, works fine
none
next-20170704 with fastboot=1, shows suspend/resume, display turns off after resume
none
test_revert
none
scale down only link m/n
none
Dmesg of working version
none
Dmesg of broken version (after patches)
none
next-20170407 with only round closest patch none

Description Rafael Ristovski 2017-04-21 19:09:42 UTC
Created attachment 130972 [details]
next-20170324 with fastboot=1, shows suspend/resume, works fine

Ever since next-20170407 (last good next-20170324, versions in-between not tested) my laptops screen stopped turning on after resuming. 
Upon removing i915.fastboot=1 from cmdline, the display now turns off immediately after boot, which means enabling fastboot only covered up the real issue.

HW:
i7-4500U Haswell with HD 4400
Comment 1 Rafael Ristovski 2017-04-21 19:10:43 UTC
Created attachment 130973 [details]
next-20170704 with fastboot=1, shows suspend/resume, display turns off after resume
Comment 2 Dhinakaran Pandiyan 2017-04-22 02:08:17 UTC
The link m_n and data m_n values are different before and after suspend/resume. My hunch is when you booted with fastboot=1, commit[1] did not reduce data m_n but did so after suspend/resume.

Before:
[drm:intel_dump_pipe_config] dp m_n: lanes: 2; gmch_m: 2428257, gmch_n: 8388608, link_m: 134903, link_n: 524288, tu: 64

After:
[drm:intel_dump_pipe_config] dp m_n: lanes: 2; gmch_m: 302776, gmch_n: 1048576, link_m: 33641, link_n: 131072, tu: 64


Try reverting this - 

commit 9a86cda07af2c63649932f0a4fc757701ef54c42
Author: Jani Nikula <jani.nikula@intel.com>
Date:   Mon Mar 27 14:33:25 2017 +0300

    drm/i915/dp: reduce link M/N parameters

    Several major vendor USB-C->HDMI converters, in particular the DA200,
    fail to recover a 5.4 GHz 1 lane signal if the link N is greater than
    0x80000.
Comment 3 Dhinakaran Pandiyan 2017-04-22 07:15:11 UTC
Created attachment 130979 [details] [review]
test_revert
Comment 4 Dhinakaran Pandiyan 2017-04-22 07:20:38 UTC
(In reply to Rafael Ristovski from comment #0)
> Created attachment 130972 [details]
> next-20170324 with fastboot=1, shows suspend/resume, works fine
> 
> Ever since next-20170407 (last good next-20170324, versions in-between not
> tested) my laptops screen stopped turning on after resuming. 
> Upon removing i915.fastboot=1 from cmdline, the display now turns off
> immediately after boot, which means enabling fastboot only covered up the
> real issue.
> 
> HW:
> i7-4500U Haswell with HD 4400

I think you are right, fastboot adjusts the m_n values, which might explain why it worked with fastboot=1. I have attached the revert for you test. Please let me know if that works.
Comment 5 Rafael Ristovski 2017-04-22 11:39:40 UTC
I can confirm that reverting 9a86cda07af2c63649932f0a4fc757701ef54c42 fixes the issue.
Both booting with fastboot off and suspend/resume work as intended.

Feel free to send me any future patches to test, although this seems to be some weird edge-case scenario with my laptop. Unless of course I'm the only one using -next on a Haswell i7 :)
Comment 6 Maarten Lankhorst 2017-04-22 17:17:33 UTC
Curious thing..

>>> 2428257. / 8388608
0.28947

>>>  302776. / 1048576
0.28874

>>> 2428257. / 8
303532.125

So this might be some rounding error?

same for link_m/n, expected 33725, got 33641.
Comment 7 Dhinakaran Pandiyan 2017-04-22 23:24:09 UTC
(In reply to Maarten Lankhorst from comment #6)
> Curious thing..
> 
> >>> 2428257. / 8388608
> 0.28947
> 
> >>>  302776. / 1048576
> 0.28874
> 
> >>> 2428257. / 8
> 303532.125
> 
> So this might be some rounding error?
> 
> same for link_m/n, expected 33725, got 33641.

33641/131072=0.2567 is closer to the original pixel_clock/link_clock ratio (69300/270000=0.2567) than 33725/131072=0.2573.

Also, in the attachment (next-20170324 with fastboot=1, shows suspend/resume, works fine) bios programs link_m: 134903 whereas driver programs link_m: 134567. Because both work fine, I guess the problem  is elsewhere.

I wonder if we should not be scaling down data m/n at all. The patch scales both data and link m/n to workaround a dongle issue, we could probably just scale down only link m/n since data m/n is for internal use.
Comment 8 Dhinakaran Pandiyan 2017-04-23 00:38:47 UTC
Created attachment 130986 [details] [review]
scale down only link m/n

Can you please test this?
Comment 9 Rafael Ristovski 2017-04-23 09:51:03 UTC
Sadly patch is a no-go. Same issue as before :/
Comment 10 Dhinakaran Pandiyan 2017-04-24 04:08:44 UTC
I suppose it's the sink that's failing to recover the stream with smaller m/n values then. I'll send the revert to the mailing list for now.
Comment 11 Jani Nikula 2017-04-24 12:53:50 UTC
For extra lols, the BIOS sets up a pixel clock that is different from the mode specified in the VBT.

Rafael, please remove any fastboot parameter, and provide dmesgs from boot, with timestamps and everything, for both 1) drm-next or whatever your baseline was, 2) same with 9a86cda07af2 ("drm/i915/dp: reduce link M/N parameters") reverted.
Comment 12 Jani Nikula 2017-04-24 13:00:19 UTC
The DP CTS defines this as the margin of error for M/N:

| (Mvid / Nvid) * Ls - Fp | <= 0.006 * Fp

Where Ls is the measured average link rate and Fp is the required pixel rate for the mode.

So let's first ignore the fastboot=1 case, and any modes set up by the BIOS, to get an apples to apples comparison before/after the regressing commit.
Comment 13 Jani Nikula 2017-04-26 14:55:18 UTC
Rafael, please also try

diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index 85b9e2f521a0..6d608e3df597 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -6288,7 +6288,7 @@ static void compute_m_n(unsigned int m, unsigned int n,
 	}
 
 	*ret_n = min_t(unsigned int, roundup_pow_of_two(n), DATA_LINK_N_MAX);
-	*ret_m = div_u64((uint64_t) m * *ret_n, n);
+	*ret_m = DIV_ROUND_CLOSEST_ULL((uint64_t) m * *ret_n, n);
 	intel_reduce_m_n_ratio(ret_m, ret_n);
 }
Comment 14 Rafael Ristovski 2017-04-26 16:16:04 UTC
(In reply to Jani Nikula from comment #13)
> Rafael, please also try
> 
> diff --git a/drivers/gpu/drm/i915/intel_display.c
> b/drivers/gpu/drm/i915/intel_display.c
> index 85b9e2f521a0..6d608e3df597 100644
> --- a/drivers/gpu/drm/i915/intel_display.c
> +++ b/drivers/gpu/drm/i915/intel_display.c
> @@ -6288,7 +6288,7 @@ static void compute_m_n(unsigned int m, unsigned int n,
>  	}
>  
>  	*ret_n = min_t(unsigned int, roundup_pow_of_two(n), DATA_LINK_N_MAX);
> -	*ret_m = div_u64((uint64_t) m * *ret_n, n);
> +	*ret_m = DIV_ROUND_CLOSEST_ULL((uint64_t) m * *ret_n, n);
>  	intel_reduce_m_n_ratio(ret_m, ret_n);
>  }

Again patch is a no-go, attaching logs of working vs bad+patched now.
Comment 15 Rafael Ristovski 2017-04-26 16:16:53 UTC
Created attachment 131053 [details]
Dmesg of working version
Comment 16 Rafael Ristovski 2017-04-26 16:17:18 UTC
Created attachment 131054 [details]
Dmesg of broken version (after patches)
Comment 17 Rafael Ristovski 2017-04-26 16:18:58 UTC
You can see a clear difference:

Good:
[1.695231] [drm:intel_dump_pipe_config] dp m_n: lanes: 2; gmch_m: 2428257, gmch_n: 8388608, link_m: 134903, link_n: 524288, tu: 64

[1.710939] [drm:intel_dump_pipe_config] dp m_n: lanes: 2; gmch_m: 2422210, gmch_n: 8388608, link_m: 134567, link_n: 524288, tu: 64


Bad:
[1.692824] [drm:intel_dump_pipe_config] dp m_n: lanes: 2; gmch_m: 2428257, gmch_n: 8388608, link_m: 134903, link_n: 524288, tu: 64

[1.708518] [drm:intel_dump_pipe_config] dp m_n: lanes: 2; gmch_m: 302776, gmch_n: 1048576, link_m: 33642, link_n: 131072, tu: 64
Comment 18 Jani Nikula 2017-04-27 07:35:35 UTC
(In reply to Jani Nikula from comment #13)
> Rafael, please also try
> 
> diff --git a/drivers/gpu/drm/i915/intel_display.c
> b/drivers/gpu/drm/i915/intel_display.c
> index 85b9e2f521a0..6d608e3df597 100644
> --- a/drivers/gpu/drm/i915/intel_display.c
> +++ b/drivers/gpu/drm/i915/intel_display.c
> @@ -6288,7 +6288,7 @@ static void compute_m_n(unsigned int m, unsigned int n,
>  	}
>  
>  	*ret_n = min_t(unsigned int, roundup_pow_of_two(n), DATA_LINK_N_MAX);
> -	*ret_m = div_u64((uint64_t) m * *ret_n, n);
> +	*ret_m = DIV_ROUND_CLOSEST_ULL((uint64_t) m * *ret_n, n);
>  	intel_reduce_m_n_ratio(ret_m, ret_n);
>  }

I meant, try this *alone* on top of drm-next, without other patches.
Comment 19 Jani Nikula 2017-04-27 07:57:36 UTC
(In reply to Rafael Ristovski from comment #17)
> You can see a clear difference:

Yeah. The link M and N values are used to regenerate the pixel clock from the link clock on the sink side. Regenrated pixel clock = M / N * link clock. They are 24 bit values, and per the DP spec the source can choose them freely, as long as they are within the error boundaries described in comment #12.

Except now we have sinks that fail on some values, and some other sinks that fail on some other values. I am not amused.

Pixel clock 69300:

> Good:
> [1.710939] [drm:intel_dump_pipe_config] dp m_n: lanes: 2; gmch_m: 2422210,
> gmch_n: 8388608, link_m: 134567, link_n: 524288, tu: 64

Regenerated pixel clock 69299.869537354. We don't know whether the sink rounds that to 69299 or 69300.

> Bad:
> [1.708518] [drm:intel_dump_pipe_config] dp m_n: lanes: 2; gmch_m: 302776,
> gmch_n: 1048576, link_m: 33642, link_n: 131072, tu: 64

Regenerated pixel clock with the rounding patch 69300.384521484 and without the rounding patch 69298.324584961. They should be rounded to 69300 and 69298, respectively.

I don't see a whole lot of difference here. Except that the sinks don't know how to count.
Comment 20 Dhinakaran Pandiyan 2017-04-27 08:10:54 UTC
The rounding should not change anything. In the attachment (next-20170324 with fastboot=1, shows suspend/resume, works fine) bios programs link_m: 134903 whereas driver programs link_m: 134567  and both work fine. This shows the sink is able to recover even with small changes in the values. I think, it's the bigger change in the down scaled values of link_m and link_n that seems to have caused the problem.
Comment 21 Jani Nikula 2017-04-27 08:30:37 UTC
(In reply to Dhinakaran Pandiyan from comment #20)
> The rounding should not change anything. In the attachment (next-20170324
> with fastboot=1, shows suspend/resume, works fine) bios programs link_m:
> 134903 whereas driver programs link_m: 134567  and both work fine. This
> shows the sink is able to recover even with small changes in the values. I
> think, it's the bigger change in the down scaled values of link_m and link_n
> that seems to have caused the problem.

No, this is a false conclusion, as the BIOS uses a different pixel clock.
Comment 22 Rafael Ristovski 2017-04-27 14:31:14 UTC
Created attachment 131098 [details]
next-20170407 with only round closest patch

[    1.694577] [drm:intel_dump_pipe_config] dp m_n: lanes: 2; gmch_m: 2428257, gmch_n: 8388608, link_m: 134903, link_n: 524288, tu: 64

[    1.710255] [drm:intel_dump_pipe_config] dp m_n: lanes: 2; gmch_m: 302776, gmch_n: 1048576, link_m: 33642, link_n: 131072, tu: 64
Comment 23 Rafael Ristovski 2017-04-27 14:31:59 UTC
Patch is still a no-go, screen doesn't recover and remains black.
Comment 24 Dhinakaran Pandiyan 2017-04-28 01:55:13 UTC
(In reply to Jani Nikula from comment #21)
> (In reply to Dhinakaran Pandiyan from comment #20)
> > The rounding should not change anything. In the attachment (next-20170324
> > with fastboot=1, shows suspend/resume, works fine) bios programs link_m:
> > 134903 whereas driver programs link_m: 134567  and both work fine. This
> > shows the sink is able to recover even with small changes in the values. I
> > think, it's the bigger change in the down scaled values of link_m and link_n
> > that seems to have caused the problem.
> 
> No, this is a false conclusion, as the BIOS uses a different pixel clock.
You are right, but if you look at the link_m and pixel_rate values for the first kernel modeset:
 
BIOS
[drm:intel_dump_pipe_config] dp m_n: lanes: 2; gmch_m: 2428257, gmch_n: 8388608, link_m: 134903, link_n: 524288, tu: 64
[drm:intel_dump_pipe_config] port clock: 270000, pipe src size: 720x400, pixel rate 69472

Kernel - before suspend-resume
[drm:intel_dp_compute_config] DP link computation with max lane count 2 max bw 270000 pixel clock 69300KHz
[drm:intel_dump_pipe_config] dp m_n: lanes: 2; gmch_m: 2428257, gmch_n: 8388608, link_m: 134903, link_n: 524288, tu: 64
-> Regenerated pixel clock - 69472.904205322


Kernel - after suspend-resume
[drm:intel_dp_compute_config] DP link computation with max lane count 2 max bw 270000 pixel clock 69300KHz
[drm:intel_dump_pipe_config] dp m_n: lanes: 2; gmch_m: 2422210, gmch_n: 8388608, link_m: 134567, link_n: 524288, tu: 64
-> Regenerated pixel clock - 69299.869537354

It does look like the sink can tolerate small differences in link_m for the same pixel clock.
Comment 25 Jani Nikula 2017-04-28 08:47:08 UTC
(In reply to Dhinakaran Pandiyan from comment #24)
> It does look like the sink can tolerate small differences in link_m for the
> same pixel clock.

Agreed, and it should. See comment #12.

It's just that the sinks screw this up based on the absolute values of M and N, not based on M/N or M/N*Ls.
Comment 26 Dhinakaran Pandiyan 2017-04-28 20:02:45 UTC
Adding laptop and panel information from Rafael.
Laptop: Dell Inspiron 3537 with a i7-4500U Haswell and HD 4400
Panel: Manufacturer: AUO Model 45ec Serial Number 0
Comment 27 Ricardo 2017-05-09 19:35:56 UTC
Adding tag into "Whiteboard" field - ReadyForDev
The bug still active
*Status is correct
*Platform is included
*Feature is included
*Priority and Severity correctly set
*Logs included

moving status to REOPENED since there is no pending information needed
Comment 28 Clinton Taylor 2017-05-10 21:40:19 UTC
Upstreamed a patch that detects the Analogix 7737 DP->HDMI converter chip inside the DA200 dongle. If the chip is detected the M and N optimization is executed, if not the M and N values are not changed. This should prevent this issue from happening on this or any other panel/monitor.

https://patchwork.freedesktop.org/patch/155478/
Comment 29 Rafael Ristovski 2017-05-11 14:44:25 UTC
(In reply to Clinton Taylor from comment #28)
> Upstreamed a patch that detects the Analogix 7737 DP->HDMI converter chip
> inside the DA200 dongle. If the chip is detected the M and N optimization is
> executed, if not the M and N values are not changed. This should prevent
> this issue from happening on this or any other panel/monitor.
> 
> https://patchwork.freedesktop.org/patch/155478/

After some thorough testing, I can firmly say that the patch works.
I've tested both normal boot and suspend and both work fine. 
Only remark is that it seems kind of awkward having a check/workaround for an issue caused by a single dongle. But I guess this is the best way to do it.
Comment 30 Dhinakaran Pandiyan 2017-05-11 16:57:19 UTC
New series from Jani which includes a rebased version of Clint's fix https://patchwork.freedesktop.org/series/24282/
Comment 31 Elizabeth 2017-06-21 16:27:24 UTC
(In reply to Dhinakaran Pandiyan from comment #30)
> New series from Jani which includes a rebased version of Clint's fix
> https://patchwork.freedesktop.org/series/24282/

Good afternoon,
Do you have any update on this case? Has the new patch been tested?
Thank you.
Comment 32 Clinton Taylor 2017-06-21 16:57:25 UTC
Patch has been tested and merged into drm-tip. This issue should now be closed. Please reopen if you have further issues.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.