Bug 101167

Summary: [BAT][SKL] igt@kms_flip@basic-flip-vs-dpms produces a dmesg-fail
Product: DRI Reporter: Martin Peres <martin.peres>
Component: DRM/IntelAssignee: Manasi <manasi.d.navare>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: critical    
Priority: highest CC: intel-gfx-bugs, ricardo.vega
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard: ReadyForDev
i915 platform: SKL i915 features: display/DP
Attachments:
Description Flags
dmesg from when we fail the test
none
dmesg for Manasi suggestion Comment 17 none

Description Martin Peres 2017-05-24 13:59:10 UTC
Starting from CI_DRM_2645, the test igt@kms_flip@basic-flip-vs-dpms sometimes produces a dmesg-fail on fi-skl-6700hq because of the following assertion:

(kms_flip:3948) igt-kms-CRITICAL: Test assertion failure function kmstest_set_connector_dpms, file igt_kms.c:1030:
(kms_flip:3948) igt-kms-CRITICAL: Failed assertion: drmModeConnectorSetProperty(fd, connector->connector_id, dpms, mode) == 0
(kms_flip:3948) igt-kms-CRITICAL: Last errno: 22, Invalid argument

This may be related to having the following errors in the logs:
[  414.849575] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to enable link training

Full logs: https://intel-gfx-ci.01.org/CI/CI_DRM_2645/fi-skl-6700hq/igt@kms_flip@basic-flip-vs-dpms.html
Comment 1 Martin Peres 2017-05-24 14:00:57 UTC
Bug about the link training error: https://bugs.freedesktop.org/show_bug.cgi?id=101144
Comment 2 Marta Löfstedt 2017-05-26 08:35:14 UTC
Note this fails on:
CI_DRM_2645, CI_DRM_2647, CI_DRM_2648, CI_DRM_2649, CI_DRM_2651, CI_DRM_2652, CI_DRM_2653, CI_DRM_2654
It can fail on any of the pipes
Comment 3 Marta Löfstedt 2017-05-26 12:39:36 UTC
I discussed this a bit with Mika Kahola and Martin Peres and we think that we shouldn't try to prune the last mode.

Also, we believe that IGT should skip (kms) tests if we can't get any mode working.
Comment 4 Marta Löfstedt 2017-05-29 10:09:00 UTC
I can't seem to be able to reproduce the issue if I add a msleep(500)
at the beginning of:
static void
intel_dp_check_link_status(struct intel_dp *intel_dp)
Comment 5 Marta Löfstedt 2017-05-29 10:21:37 UTC
This also appear to do the trick:

diff --git a/drivers/gpu/drm/i915/intel_dp.c b/drivers/gpu/drm/i915/intel_dp.c
index 4a6feb6..ec6de81 100644
--- a/drivers/gpu/drm/i915/intel_dp.c
+++ b/drivers/gpu/drm/i915/intel_dp.c
@@ -1106,7 +1106,7 @@ intel_dp_aux_ch(struct intel_dp *intel_dp,
                                                          aux_clock_divider);
 
                /* Must try at least 3 times according to DP spec */
-               for (try = 0; try < 5; try++) {
+               for (try = 0; try < 10; try++) {


But I since the issue hasn't been 100% reproducible I believe we need to find a root cause and not just slowing things down.
Comment 6 Marta Löfstedt 2017-05-30 07:42:02 UTC
If I revert:

"commit 9301397a63b3bf1090dffe846c6f1c8efa032236
Author: Manasi Navare <manasi.d.navare@intel.com>
Date:   Thu Apr 6 16:44:19 2017 +0300

    drm/i915: Implement Link Rate fallback on Link training failure"

I can't reproduce the test failing. However I still get the dmesg-warn (see bug https://bugs.freedesktop.org/show_bug.cgi?id=101144 )
Comment 7 Jani Nikula 2017-05-30 12:53:32 UTC
(In reply to Marta Löfstedt from comment #6)
> If I revert:
> 
> "commit 9301397a63b3bf1090dffe846c6f1c8efa032236
> Author: Manasi Navare <manasi.d.navare@intel.com>
> Date:   Thu Apr 6 16:44:19 2017 +0300
> 
>     drm/i915: Implement Link Rate fallback on Link training failure"
> 
> I can't reproduce the test failing. However I still get the dmesg-warn (see
> bug https://bugs.freedesktop.org/show_bug.cgi?id=101144 )

Before the change, we plunge on with channel equalization regardless of failures during clock recovery phase. By magic, on some machines it'll still all work despite the errors.

Per spec, the correct thing to do is to check and handle the failures gracefully, and reducing the link, which Manasi's commit attempts to do.

(In reply to Marta Löfstedt from comment #5)
> But I since the issue hasn't been 100% reproducible I believe we need to
> find a root cause and not just slowing things down.

Agreed. In this case it seems to boil down to DP aux channel communication. On a whim, please try:

diff --git a/drivers/gpu/drm/i915/intel_dp.c b/drivers/gpu/drm/i915/intel_dp.c
index fc691b8b317c..fb568f75939f 100644
--- a/drivers/gpu/drm/i915/intel_dp.c
+++ b/drivers/gpu/drm/i915/intel_dp.c
@@ -2431,6 +2431,8 @@ void intel_dp_sink_dpms(struct intel_dp *intel_dp, int mode)
 {
 	int ret, i;
 
+	return;
+
 	/* Should have a valid DPCD by this point */
 	if (intel_dp->dpcd[DP_DPCD_REV] < 0x11)
 		return;
Comment 8 Marta Löfstedt 2017-05-31 05:55:17 UTC
(In reply to Jani Nikula from comment #7)
...
> On a whim, please try:
> 
> diff --git a/drivers/gpu/drm/i915/intel_dp.c
> b/drivers/gpu/drm/i915/intel_dp.c
> index fc691b8b317c..fb568f75939f 100644
> --- a/drivers/gpu/drm/i915/intel_dp.c
> +++ b/drivers/gpu/drm/i915/intel_dp.c
> @@ -2431,6 +2431,8 @@ void intel_dp_sink_dpms(struct intel_dp *intel_dp, int
> mode)
>  {
>  	int ret, i;
>  
> +	return;
> +
>  	/* Should have a valid DPCD by this point */
>  	if (intel_dp->dpcd[DP_DPCD_REV] < 0x11)
>  		return;

To my surprise above doesn't work. So, it appears as if it is not dpms related.
Comment 9 Marta Löfstedt 2017-05-31 07:32:58 UTC
I have reproduced the issue with 4.11.0-rc7+

"commit 0109808145180d0ed3da05361f80839e6c80caeb
Author: Colin Ian King <colin.king@canonical.com>
Date:   Fri May 19 18:56:17 2017 +0100

    drm/i915: Check for allocation failure"

This commit is even before we start seeing the dmesg-warn (BUG https://bugs.freedesktop.org/show_bug.cgi?id=101144)
Comment 10 Jani Nikula 2017-06-02 08:03:31 UTC
(In reply to Marta Löfstedt from comment #6)
> If I revert:
> 
> "commit 9301397a63b3bf1090dffe846c6f1c8efa032236
> Author: Manasi Navare <manasi.d.navare@intel.com>
> Date:   Thu Apr 6 16:44:19 2017 +0300
> 
>     drm/i915: Implement Link Rate fallback on Link training failure"
> 
> I can't reproduce the test failing. However I still get the dmesg-warn (see
> bug https://bugs.freedesktop.org/show_bug.cgi?id=101144 )

Even if the underlying problem is there before 9301397a63b3 ("drm/i915: Implement Link Rate fallback on Link training failure"), the commit makes the situation worse. It doesn't matter that it does the right thing according to the specs, the user experience matters. It's a regression. Either it needs to be fixed ASAP or reverted.
Comment 11 Marta Löfstedt 2017-06-05 06:16:54 UTC
Created attachment 131703 [details]
dmesg from when we fail the test

To clarify and distiguish from bug:
https://bugs.freedesktop.org/show_bug.cgi?id=101144

I added dmesg from when we fail the test. This is also available at: https://intel-gfx-ci.01.org/CI/CI_DRM_2645/fi-skl-6700hq/igt@kms_flip@basic-flip-vs-dpms.html

and a lot of the following CI_DRM_NNNN runs have the same issue.
Comment 12 Marta Löfstedt 2017-06-05 08:31:52 UTC
I was wondering why we don't sleep when we are timed out on the aux channel.
So, I decided to test below:

--- a/drivers/gpu/drm/drm_dp_helper.c
+++ b/drivers/gpu/drm/drm_dp_helper.c
@@ -198,7 +198,7 @@ static int drm_dp_dpcd_access(struct drm_dp_aux *aux, u8 request,
         * sufficient, bump to 32 which makes Dell 4k monitors happier.
         */
        for (retry = 0; retry < 32; retry++) {
-               if (ret != 0 && ret != -ETIMEDOUT) {
+               if (ret != 0 /*&& ret != -ETIMEDOUT*/) {

To my surprise this does NOT help. 

Note:
 commit e1083ff35157185b01bc0a99cb19b7cbae0fc9fa
Author: Lyude <cpaul@redhat.com>
Date:   Wed Apr 13 10:58:30 2016 -0400

    drm/dp_helper: Always wait before retrying native aux transactions

There is also a reference to a discussion about a similar issue, but this time with docking station involved for this patch:
commit 82922da39190199260a726d7081a8ea4873e5fd6
Author: Lyude <cpaul@redhat.com>
Date:   Wed Apr 13 10:58:31 2016 -0400

    drm/dp_helper: Retry aux transactions on all errors
Comment 13 Marta Löfstedt 2017-06-05 08:46:51 UTC
tested without psr:
[  130.895130] [drm:intel_psr_enable] PSR disable by flag

but this doesn't make any difference test is still failing and then we end up in a state without any modes.
Comment 14 Marta Löfstedt 2017-06-05 13:36:21 UTC
(In reply to Jani Nikula from comment #10)
...
> 
> Even if the underlying problem is there before 9301397a63b3 ("drm/i915:
> Implement Link Rate fallback on Link training failure"), the commit makes
> the situation worse. It doesn't matter that it does the right thing
> according to the specs, the user experience matters. It's a regression.
> Either it needs to be fixed ASAP or reverted.

Jani, how about a compromise just do the traing if we are not edp?


+++ b/drivers/gpu/drm/i915/intel_dp_link_training.c
@@ -308,9 +308,17 @@ void intel_dp_stop_link_train(struct intel_dp *intel_dp)
                                DP_TRAINING_PATTERN_DISABLE);
 }

 void
 intel_dp_start_link_train(struct intel_dp *intel_dp)
 {  
        struct intel_connector *intel_connector = intel_dp->attached_connector;
 
        if (!intel_dp_link_training_clock_recovery(intel_dp))
@@ -328,10 +336,12 @@ intel_dp_start_link_train(struct intel_dp *intel_dp)
  failure_handling:
        DRM_DEBUG_KMS("Link Training failed at link rate = %d, lane count = %d",
                      intel_dp->link_rate, intel_dp->lane_count);
-       if (!intel_dp_get_link_train_fallback_values(intel_dp,
-                                                    intel_dp->link_rate,
-                                                    intel_dp->lane_count))
-               /* Schedule a Hotplug Uevent to userspace to start modeset */
-               schedule_work(&intel_connector->modeset_retry_work);
+       if (!is_edp(intel_dp)) {
+         if (!intel_dp_get_link_train_fallback_values(intel_dp,
+                                                      intel_dp->link_rate,
+                                                      intel_dp->lane_count))
+           /* Schedule a Hotplug Uevent to userspace to start modeset */
+           schedule_work(&intel_connector->modeset_retry_work);
+       }
Comment 15 Manasi 2017-06-05 18:59:49 UTC
Does the test pass by not doing fallback on eDP?
Comment 16 Manasi 2017-06-06 00:43:12 UTC
Could you actually try this:

In intel_dp_link_training_clock_recovery():

Could you change it to following:

if(!intel_dp_reset_link_train(intel_dp,
                              DP_TRAINING_PATTERN1 |
                              DP_LINK_SCRAMBLING_DISABLE)) {
   DRM_ERROR();
   return true;
}




So just try returning a true there and lets see what it does.

Manasi
Comment 17 Marta Löfstedt 2017-06-06 06:04:16 UTC
(In reply to Manasi from comment #15)
> Does the test pass by not doing fallback on eDP?

We don't fail the test, but we still get the dmesg-warn, see bug 101144
Comment 18 Marta Löfstedt 2017-06-06 06:22:40 UTC
Created attachment 131731 [details]
dmesg for Manasi suggestion Comment 17
Comment 19 Marta Löfstedt 2017-06-06 06:54:11 UTC
(In reply to Manasi from comment #16)
> Could you actually try this:
> 
> In intel_dp_link_training_clock_recovery():
> 
> Could you change it to following:
> 
> if(!intel_dp_reset_link_train(intel_dp,
>                               DP_TRAINING_PATTERN1 |
>                               DP_LINK_SCRAMBLING_DISABLE)) {
>    DRM_ERROR();
>    return true;
> }
> 
> 
> 
> 
> So just try returning a true there and lets see what it does.
> 
> Manasi

Manasi, with above the test fails and then we are left without any modes as with a lot of other stuff I have tested. As far as I have understood eDP only has one mode. If we prune it we have nothing left, this could be solved as in Comment 14. Then we could continue the investigation as to why this specific device started to behave bad on the aux channel in bug 101114.
Comment 20 Marta Löfstedt 2017-06-06 06:55:56 UTC
(In reply to Marta Löfstedt from comment #19)
> (In reply to Manasi from comment #16)
> > Could you actually try this:
> > 
> > In intel_dp_link_training_clock_recovery():
> > 
> > Could you change it to following:
> > 
> > if(!intel_dp_reset_link_train(intel_dp,
> >                               DP_TRAINING_PATTERN1 |
> >                               DP_LINK_SCRAMBLING_DISABLE)) {
> >    DRM_ERROR();
> >    return true;
> > }
> > 
> > 
> > 
> > 
> > So just try returning a true there and lets see what it does.
> > 
> > Manasi
> 
> Manasi, with above the test fails and then we are left without any modes as
> with a lot of other stuff I have tested. As far as I have understood eDP
> only has one mode. If we prune it we have nothing left, this could be solved
> as in Comment 14. Then we could continue the investigation as to why this
> specific device started to behave bad on the aux channel in bug 101114.

Sorry bad reference above should have been bug 101144.
Comment 21 Manasi 2017-06-06 15:23:42 UTC
So after returning true on "Failed to enable link training", the link training still fails may be because since the TP was not written correctly, the CR or Ch EQ fails and link training fails. Is that what you are seeing?

Manasi
Comment 22 Manasi 2017-06-06 19:03:47 UTC
(In reply to Marta Löfstedt from comment #18)
> Created attachment 131731 [details]
> dmesg for Manasi suggestion Comment 17

This dmesg log shows that we still do a fallback on eDP when link training fails the first time but then it seems to pass at reduced link rate to 162000.

Now why don't we see any dp_aux_ch timeouts at this econd attempt of modeset with reduced link rate?

So the test passes here with just that first warning right? Isnt that whats expected though?

Manasi
Comment 23 Martin Peres 2017-06-06 19:09:22 UTC
(In reply to Manasi from comment #22)
> (In reply to Marta Löfstedt from comment #18)
> > Created attachment 131731 [details]
> > dmesg for Manasi suggestion Comment 17
> 
> This dmesg log shows that we still do a fallback on eDP when link training
> fails the first time but then it seems to pass at reduced link rate to
> 162000.
> 
> Now why don't we see any dp_aux_ch timeouts at this econd attempt of modeset
> with reduced link rate?

The two are unrelated. It just happens that now, the panel has had enough time to start.

> 
> So the test passes here with just that first warning right? Isnt that whats
> expected though?

No, the test fails because the mode cannot be set again, because it has been pruned (making sure the lowest mode never gets pruned is another bug).

So, this bug is about IGT not checking that a mode is available before setting it. This never was a problem because IGT simply sets the same mode that was already-set when starting the test. In other word, the link-status patch broke this assumption and IGT does not know what to do.
Comment 24 Marta Löfstedt 2017-06-07 08:47:23 UTC
Martin, if I can get some time on the device, I will test Chris igt patch:

https://patchwork.freedesktop.org/patch/158352/

The assumption is that with the IGT patch the new reduced link rate should be accepted and we would just get one dmesg-warn and no failing tests.
Comment 25 Marta Löfstedt 2017-06-07 10:44:53 UTC
I tested Chris patch, but the test still fails and we still end up in the no modes available state.
Comment 26 Marta Löfstedt 2017-06-07 12:29:17 UTC
Also, remember that when the failing starts at for example  igt@kms_busy@basic-flip-default-c the we fail at a very different place compared to the kms_flip fail. However, then we see from IGT:
(kms_busy:3941) igt-kms-WARNING: connector 48/eDP-1 has no modes
ssame hold true for the: igt@kms_pipe_crc_basic and igt@kms_cursor_legacy fails that is going on on this machine after we have ended up in this state.
I discussed with Petri if IGT should skip when we end up in this "no modes" state. But I believe Petri doesn't want the IGT skip to be used for bad states that has occurred in test run-time. Such as the skip by kms_flip.
The suggestion would be to implement an IGT abort exit criteria instead of using the IGT skip for different reasons.
Comment 27 Manasi 2017-06-07 16:08:59 UTC
I still think that the real reason here for the link failure is the aux ch timeouts and in that case the link training should not fallback and reduce link rate/lane count. 

Also if we add !is_edp to our fallback algorithm, it should behave exactly like it did before the patch for eDP, so the modes should not get pruned and the test should pass. Did it in this case?

Also in general for IGT to handle the link failures, the kernel does not prune any mode itself after link failure, all it does is reduces link rate/lane count. So if IGT does a new modeset at the same mode again and if that mode is too high for available link bw, it will get pruned. In that case, IGT has to do a full modeset on a lower mode and if no mode is available, yes the IGT will fail/skip the test. 

Manasi
Comment 28 Martin Peres 2017-07-05 07:43:08 UTC
This should fix the issue:
c99a259 drm/i915/edp: Add a T12 panel delay quirk to fix DP AUX CH timeouts
Comment 29 Martin Peres 2017-07-05 07:43:24 UTC
No need to ask QA to verify, so forcing to CLOSED.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.