Summary: | [BAT][SKL] igt@kms_flip@basic-flip-vs-dpms produces a dmesg-fail | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Martin Peres <martin.peres> | ||||||
Component: | DRM/Intel | Assignee: | Manasi <manasi.d.navare> | ||||||
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||
Severity: | critical | ||||||||
Priority: | highest | CC: | intel-gfx-bugs, ricardo.vega | ||||||
Version: | DRI git | ||||||||
Hardware: | Other | ||||||||
OS: | All | ||||||||
Whiteboard: | ReadyForDev | ||||||||
i915 platform: | SKL | i915 features: | display/DP | ||||||
Attachments: |
|
Description
Martin Peres
2017-05-24 13:59:10 UTC
Bug about the link training error: https://bugs.freedesktop.org/show_bug.cgi?id=101144 Note this fails on: CI_DRM_2645, CI_DRM_2647, CI_DRM_2648, CI_DRM_2649, CI_DRM_2651, CI_DRM_2652, CI_DRM_2653, CI_DRM_2654 It can fail on any of the pipes I discussed this a bit with Mika Kahola and Martin Peres and we think that we shouldn't try to prune the last mode. Also, we believe that IGT should skip (kms) tests if we can't get any mode working. I can't seem to be able to reproduce the issue if I add a msleep(500) at the beginning of: static void intel_dp_check_link_status(struct intel_dp *intel_dp) This also appear to do the trick: diff --git a/drivers/gpu/drm/i915/intel_dp.c b/drivers/gpu/drm/i915/intel_dp.c index 4a6feb6..ec6de81 100644 --- a/drivers/gpu/drm/i915/intel_dp.c +++ b/drivers/gpu/drm/i915/intel_dp.c @@ -1106,7 +1106,7 @@ intel_dp_aux_ch(struct intel_dp *intel_dp, aux_clock_divider); /* Must try at least 3 times according to DP spec */ - for (try = 0; try < 5; try++) { + for (try = 0; try < 10; try++) { But I since the issue hasn't been 100% reproducible I believe we need to find a root cause and not just slowing things down. If I revert: "commit 9301397a63b3bf1090dffe846c6f1c8efa032236 Author: Manasi Navare <manasi.d.navare@intel.com> Date: Thu Apr 6 16:44:19 2017 +0300 drm/i915: Implement Link Rate fallback on Link training failure" I can't reproduce the test failing. However I still get the dmesg-warn (see bug https://bugs.freedesktop.org/show_bug.cgi?id=101144 ) (In reply to Marta Löfstedt from comment #6) > If I revert: > > "commit 9301397a63b3bf1090dffe846c6f1c8efa032236 > Author: Manasi Navare <manasi.d.navare@intel.com> > Date: Thu Apr 6 16:44:19 2017 +0300 > > drm/i915: Implement Link Rate fallback on Link training failure" > > I can't reproduce the test failing. However I still get the dmesg-warn (see > bug https://bugs.freedesktop.org/show_bug.cgi?id=101144 ) Before the change, we plunge on with channel equalization regardless of failures during clock recovery phase. By magic, on some machines it'll still all work despite the errors. Per spec, the correct thing to do is to check and handle the failures gracefully, and reducing the link, which Manasi's commit attempts to do. (In reply to Marta Löfstedt from comment #5) > But I since the issue hasn't been 100% reproducible I believe we need to > find a root cause and not just slowing things down. Agreed. In this case it seems to boil down to DP aux channel communication. On a whim, please try: diff --git a/drivers/gpu/drm/i915/intel_dp.c b/drivers/gpu/drm/i915/intel_dp.c index fc691b8b317c..fb568f75939f 100644 --- a/drivers/gpu/drm/i915/intel_dp.c +++ b/drivers/gpu/drm/i915/intel_dp.c @@ -2431,6 +2431,8 @@ void intel_dp_sink_dpms(struct intel_dp *intel_dp, int mode) { int ret, i; + return; + /* Should have a valid DPCD by this point */ if (intel_dp->dpcd[DP_DPCD_REV] < 0x11) return; (In reply to Jani Nikula from comment #7) ... > On a whim, please try: > > diff --git a/drivers/gpu/drm/i915/intel_dp.c > b/drivers/gpu/drm/i915/intel_dp.c > index fc691b8b317c..fb568f75939f 100644 > --- a/drivers/gpu/drm/i915/intel_dp.c > +++ b/drivers/gpu/drm/i915/intel_dp.c > @@ -2431,6 +2431,8 @@ void intel_dp_sink_dpms(struct intel_dp *intel_dp, int > mode) > { > int ret, i; > > + return; > + > /* Should have a valid DPCD by this point */ > if (intel_dp->dpcd[DP_DPCD_REV] < 0x11) > return; To my surprise above doesn't work. So, it appears as if it is not dpms related. I have reproduced the issue with 4.11.0-rc7+ "commit 0109808145180d0ed3da05361f80839e6c80caeb Author: Colin Ian King <colin.king@canonical.com> Date: Fri May 19 18:56:17 2017 +0100 drm/i915: Check for allocation failure" This commit is even before we start seeing the dmesg-warn (BUG https://bugs.freedesktop.org/show_bug.cgi?id=101144) (In reply to Marta Löfstedt from comment #6) > If I revert: > > "commit 9301397a63b3bf1090dffe846c6f1c8efa032236 > Author: Manasi Navare <manasi.d.navare@intel.com> > Date: Thu Apr 6 16:44:19 2017 +0300 > > drm/i915: Implement Link Rate fallback on Link training failure" > > I can't reproduce the test failing. However I still get the dmesg-warn (see > bug https://bugs.freedesktop.org/show_bug.cgi?id=101144 ) Even if the underlying problem is there before 9301397a63b3 ("drm/i915: Implement Link Rate fallback on Link training failure"), the commit makes the situation worse. It doesn't matter that it does the right thing according to the specs, the user experience matters. It's a regression. Either it needs to be fixed ASAP or reverted. Created attachment 131703 [details] dmesg from when we fail the test To clarify and distiguish from bug: https://bugs.freedesktop.org/show_bug.cgi?id=101144 I added dmesg from when we fail the test. This is also available at: https://intel-gfx-ci.01.org/CI/CI_DRM_2645/fi-skl-6700hq/igt@kms_flip@basic-flip-vs-dpms.html and a lot of the following CI_DRM_NNNN runs have the same issue. I was wondering why we don't sleep when we are timed out on the aux channel. So, I decided to test below: --- a/drivers/gpu/drm/drm_dp_helper.c +++ b/drivers/gpu/drm/drm_dp_helper.c @@ -198,7 +198,7 @@ static int drm_dp_dpcd_access(struct drm_dp_aux *aux, u8 request, * sufficient, bump to 32 which makes Dell 4k monitors happier. */ for (retry = 0; retry < 32; retry++) { - if (ret != 0 && ret != -ETIMEDOUT) { + if (ret != 0 /*&& ret != -ETIMEDOUT*/) { To my surprise this does NOT help. Note: commit e1083ff35157185b01bc0a99cb19b7cbae0fc9fa Author: Lyude <cpaul@redhat.com> Date: Wed Apr 13 10:58:30 2016 -0400 drm/dp_helper: Always wait before retrying native aux transactions There is also a reference to a discussion about a similar issue, but this time with docking station involved for this patch: commit 82922da39190199260a726d7081a8ea4873e5fd6 Author: Lyude <cpaul@redhat.com> Date: Wed Apr 13 10:58:31 2016 -0400 drm/dp_helper: Retry aux transactions on all errors tested without psr: [ 130.895130] [drm:intel_psr_enable] PSR disable by flag but this doesn't make any difference test is still failing and then we end up in a state without any modes. (In reply to Jani Nikula from comment #10) ... > > Even if the underlying problem is there before 9301397a63b3 ("drm/i915: > Implement Link Rate fallback on Link training failure"), the commit makes > the situation worse. It doesn't matter that it does the right thing > according to the specs, the user experience matters. It's a regression. > Either it needs to be fixed ASAP or reverted. Jani, how about a compromise just do the traing if we are not edp? +++ b/drivers/gpu/drm/i915/intel_dp_link_training.c @@ -308,9 +308,17 @@ void intel_dp_stop_link_train(struct intel_dp *intel_dp) DP_TRAINING_PATTERN_DISABLE); } void intel_dp_start_link_train(struct intel_dp *intel_dp) { struct intel_connector *intel_connector = intel_dp->attached_connector; if (!intel_dp_link_training_clock_recovery(intel_dp)) @@ -328,10 +336,12 @@ intel_dp_start_link_train(struct intel_dp *intel_dp) failure_handling: DRM_DEBUG_KMS("Link Training failed at link rate = %d, lane count = %d", intel_dp->link_rate, intel_dp->lane_count); - if (!intel_dp_get_link_train_fallback_values(intel_dp, - intel_dp->link_rate, - intel_dp->lane_count)) - /* Schedule a Hotplug Uevent to userspace to start modeset */ - schedule_work(&intel_connector->modeset_retry_work); + if (!is_edp(intel_dp)) { + if (!intel_dp_get_link_train_fallback_values(intel_dp, + intel_dp->link_rate, + intel_dp->lane_count)) + /* Schedule a Hotplug Uevent to userspace to start modeset */ + schedule_work(&intel_connector->modeset_retry_work); + } Does the test pass by not doing fallback on eDP? Could you actually try this: In intel_dp_link_training_clock_recovery(): Could you change it to following: if(!intel_dp_reset_link_train(intel_dp, DP_TRAINING_PATTERN1 | DP_LINK_SCRAMBLING_DISABLE)) { DRM_ERROR(); return true; } So just try returning a true there and lets see what it does. Manasi (In reply to Manasi from comment #15) > Does the test pass by not doing fallback on eDP? We don't fail the test, but we still get the dmesg-warn, see bug 101144 Created attachment 131731 [details] dmesg for Manasi suggestion Comment 17 (In reply to Manasi from comment #16) > Could you actually try this: > > In intel_dp_link_training_clock_recovery(): > > Could you change it to following: > > if(!intel_dp_reset_link_train(intel_dp, > DP_TRAINING_PATTERN1 | > DP_LINK_SCRAMBLING_DISABLE)) { > DRM_ERROR(); > return true; > } > > > > > So just try returning a true there and lets see what it does. > > Manasi Manasi, with above the test fails and then we are left without any modes as with a lot of other stuff I have tested. As far as I have understood eDP only has one mode. If we prune it we have nothing left, this could be solved as in Comment 14. Then we could continue the investigation as to why this specific device started to behave bad on the aux channel in bug 101114. (In reply to Marta Löfstedt from comment #19) > (In reply to Manasi from comment #16) > > Could you actually try this: > > > > In intel_dp_link_training_clock_recovery(): > > > > Could you change it to following: > > > > if(!intel_dp_reset_link_train(intel_dp, > > DP_TRAINING_PATTERN1 | > > DP_LINK_SCRAMBLING_DISABLE)) { > > DRM_ERROR(); > > return true; > > } > > > > > > > > > > So just try returning a true there and lets see what it does. > > > > Manasi > > Manasi, with above the test fails and then we are left without any modes as > with a lot of other stuff I have tested. As far as I have understood eDP > only has one mode. If we prune it we have nothing left, this could be solved > as in Comment 14. Then we could continue the investigation as to why this > specific device started to behave bad on the aux channel in bug 101114. Sorry bad reference above should have been bug 101144. So after returning true on "Failed to enable link training", the link training still fails may be because since the TP was not written correctly, the CR or Ch EQ fails and link training fails. Is that what you are seeing? Manasi (In reply to Marta Löfstedt from comment #18) > Created attachment 131731 [details] > dmesg for Manasi suggestion Comment 17 This dmesg log shows that we still do a fallback on eDP when link training fails the first time but then it seems to pass at reduced link rate to 162000. Now why don't we see any dp_aux_ch timeouts at this econd attempt of modeset with reduced link rate? So the test passes here with just that first warning right? Isnt that whats expected though? Manasi (In reply to Manasi from comment #22) > (In reply to Marta Löfstedt from comment #18) > > Created attachment 131731 [details] > > dmesg for Manasi suggestion Comment 17 > > This dmesg log shows that we still do a fallback on eDP when link training > fails the first time but then it seems to pass at reduced link rate to > 162000. > > Now why don't we see any dp_aux_ch timeouts at this econd attempt of modeset > with reduced link rate? The two are unrelated. It just happens that now, the panel has had enough time to start. > > So the test passes here with just that first warning right? Isnt that whats > expected though? No, the test fails because the mode cannot be set again, because it has been pruned (making sure the lowest mode never gets pruned is another bug). So, this bug is about IGT not checking that a mode is available before setting it. This never was a problem because IGT simply sets the same mode that was already-set when starting the test. In other word, the link-status patch broke this assumption and IGT does not know what to do. Martin, if I can get some time on the device, I will test Chris igt patch: https://patchwork.freedesktop.org/patch/158352/ The assumption is that with the IGT patch the new reduced link rate should be accepted and we would just get one dmesg-warn and no failing tests. I tested Chris patch, but the test still fails and we still end up in the no modes available state. Also, remember that when the failing starts at for example igt@kms_busy@basic-flip-default-c the we fail at a very different place compared to the kms_flip fail. However, then we see from IGT: (kms_busy:3941) igt-kms-WARNING: connector 48/eDP-1 has no modes ssame hold true for the: igt@kms_pipe_crc_basic and igt@kms_cursor_legacy fails that is going on on this machine after we have ended up in this state. I discussed with Petri if IGT should skip when we end up in this "no modes" state. But I believe Petri doesn't want the IGT skip to be used for bad states that has occurred in test run-time. Such as the skip by kms_flip. The suggestion would be to implement an IGT abort exit criteria instead of using the IGT skip for different reasons. I still think that the real reason here for the link failure is the aux ch timeouts and in that case the link training should not fallback and reduce link rate/lane count. Also if we add !is_edp to our fallback algorithm, it should behave exactly like it did before the patch for eDP, so the modes should not get pruned and the test should pass. Did it in this case? Also in general for IGT to handle the link failures, the kernel does not prune any mode itself after link failure, all it does is reduces link rate/lane count. So if IGT does a new modeset at the same mode again and if that mode is too high for available link bw, it will get pruned. In that case, IGT has to do a full modeset on a lower mode and if no mode is available, yes the IGT will fail/skip the test. Manasi This should fix the issue: c99a259 drm/i915/edp: Add a T12 panel delay quirk to fix DP AUX CH timeouts No need to ask QA to verify, so forcing to CLOSED. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.