Bug 101144 - [BAT][SKL] *ERROR* failed to enable link training
Summary: [BAT][SKL] *ERROR* failed to enable link training
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: high normal
Assignee: Manasi
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2017-05-22 09:13 UTC by Martin Peres
Modified: 2018-04-11 09:24 UTC (History)
1 user (show)

See Also:
i915 platform: KBL, SKL
i915 features: display/DP


Attachments
dmesg when only dmesg-warn (1.72 MB, text/x-log)
2017-06-05 06:11 UTC, Marta Löfstedt
no flags Details
Test Patch to see if powering up the panel again helps (1.71 KB, patch)
2017-06-14 02:18 UTC, Manasi
no flags Details | Splinter Review
dmesg for Mansis test patch (334.80 KB, text/plain)
2017-06-14 06:44 UTC, Marta Löfstedt
no flags Details
debug_patch_0614 (775 bytes, patch)
2017-06-15 01:49 UTC, Manasi
no flags Details | Splinter Review
debug_patch_0615 (4.43 KB, patch)
2017-06-16 00:24 UTC, Manasi
no flags Details | Splinter Review
debug_with_power_cycle_0615 (1.52 KB, patch)
2017-06-16 00:34 UTC, Manasi
no flags Details | Splinter Review
dmesg 2 (232.24 KB, text/plain)
2017-06-16 07:10 UTC, Marta Löfstedt
no flags Details
Dmesg 3 (475.18 KB, text/plain)
2017-06-16 07:11 UTC, Marta Löfstedt
no flags Details
Dmesg 4 (6.24 MB, text/plain)
2017-06-16 07:13 UTC, Marta Löfstedt
no flags Details
dmesg 2 debug_with_panel_power_cycle (1.21 MB, text/plain)
2017-06-16 07:50 UTC, Marta Löfstedt
no flags Details
Ville + 2 Manasi + tweak (5.73 MB, text/plain)
2017-06-16 08:42 UTC, Marta Löfstedt
no flags Details
1_Patch_set_1 (4.43 KB, patch)
2017-06-20 23:45 UTC, Manasi
no flags Details | Splinter Review
2_Patch_Set_1 (1.33 KB, patch)
2017-06-20 23:46 UTC, Manasi
no flags Details | Splinter Review
3_Patch_Set_1 (993 bytes, patch)
2017-06-20 23:46 UTC, Manasi
no flags Details | Splinter Review
1_Patch_Set_2 (4.43 KB, patch)
2017-06-20 23:46 UTC, Manasi
no flags Details | Splinter Review
2_Patch_Set_2 (1.64 KB, patch)
2017-06-20 23:47 UTC, Manasi
no flags Details | Splinter Review
3_Patch_Set_3 (994 bytes, patch)
2017-06-20 23:47 UTC, Manasi
no flags Details | Splinter Review

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Peres 2017-05-22 09:13:55 UTC
The test igt@kms_flip@basic-flip-vs-dpms generated a dmesg error (but still passed) when running on fi-skl-6700hq on CI_DRM_2640:

[drm:intel_dp_start_link_train [i915]] *ERROR* failed to enable link training
[drm:intel_dp_start_link_train [i915]] Link Training failed at link rate = 270000, lane count = 2

This is possibly related to the following message a little above in the logs (up to the warning reported by piglit):

[...]
[  420.613916] [drm:intel_dp_aux_ch [i915]] dp_aux_ch timeout status 0x7d4003ff
[  420.622421] [drm:intel_dp_aux_ch [i915]] dp_aux_ch timeout status 0x7d4003ff
[  420.630911] [drm:intel_dp_aux_ch [i915]] dp_aux_ch timeout status 0x7d4003ff
[  420.630939] [drm:drm_dp_dpcd_access] Too many retries, giving up. First error: -110
[  420.634122] [drm:intel_dp_sink_dpms [i915]] failed to enable sink power state
[...]
[  421.172326] [drm:intel_dp_aux_ch [i915]] dp_aux_ch timeout status 0x7d4003ff
[  421.180857] [drm:intel_dp_aux_ch [i915]] dp_aux_ch timeout status 0x7d4003ff
[  421.180869] [drm:drm_dp_dpcd_access] Too many retries, giving up. First error: -110
[  421.180893] [drm:intel_dp_set_signal_levels [i915]] Using signal levels 00000000
[  421.180911] [drm:intel_dp_set_signal_levels [i915]] Using vswing level 0
[  421.180971] [drm:intel_dp_set_signal_levels [i915]] Using pre-emphasis level 0
[  421.180991] [drm:intel_dp_program_link_training_pattern [i915]] Using DP training pattern TPS1
[...]
[  421.412348] [drm:intel_dp_aux_ch [i915]] dp_aux_ch timeout status 0x7d4003ff
[  421.420917] [drm:intel_dp_aux_ch [i915]] dp_aux_ch timeout status 0x7d4003ff
[  421.429499] [drm:intel_dp_aux_ch [i915]] dp_aux_ch timeout status 0x7d4003ff
[  421.438068] [drm:intel_dp_aux_ch [i915]] dp_aux_ch timeout status 0x7d4003ff
[  421.446637] [drm:intel_dp_aux_ch [i915]] dp_aux_ch timeout status 0x7d4003ff
[  421.455206] [drm:intel_dp_aux_ch [i915]] dp_aux_ch timeout status 0x7d4003ff
[  421.455218] [drm:drm_dp_dpcd_access] Too many retries, giving up. First error: -110
[  421.455240] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to enable link training
[  421.455261] [drm:intel_dp_start_link_train [i915]] Link Training failed at link rate = 270000, lane count = 2
[  421.455353] [drm:intel_dp_modeset_retry_work_fn [i915]] [CONNECTOR:48:eDP-1]
Comment 1 Manasi 2017-05-22 18:41:57 UTC
I think it still passed because now we handle link failures and so it probably executed the link training fallback and passed at the lower link rate/lane count.
Comment 2 Martin Peres 2017-05-22 19:08:09 UTC
(In reply to Manasi from comment #1)
> I think it still passed because now we handle link failures and so it
> probably executed the link training fallback and passed at the lower link
> rate/lane count.

It did, but that changed the result of the piglit test from pass to dmesg-warn, so we likely have to filter these strings out in IGT ;)
Comment 3 Martin Peres 2017-05-23 11:56:50 UTC
The same happened on pre-merge: https://intel-gfx-ci.01.org/CI/Patchwork_4772/fi-skl-6700hq/igt@kms_busy@basic-flip-default-a.html
Comment 4 Daniel Vetter 2017-05-23 12:03:03 UTC
(In reply to Manasi from comment #1)
> I think it still passed because now we handle link failures and so it
> probably executed the link training fallback and passed at the lower link
> rate/lane count.

In theory true, in practice there's 0 igt testcases that exercise this, which is a gap we should fix. Before we've fixed that gap in test coverage I don't think hiding these is a good idea.
Comment 5 Marta Löfstedt 2017-05-24 13:49:51 UTC
CI_DRM_2640, CI_DRM_2641, CI_DRM_2642, CI_DRM_2644, CI_DRM_2646, CI_DRM_2650 - Dmesg warn on: [drm:intel_dp_start_link_train [i915]] *ERROR* failed to enable link training

CI_DRM_2643 - Pass

CI_DRM_2645, CI_DRM_2647, CI_DRM_2648, CI_DRM_2649, CI_DRM_2651 - Fail: (kms_flip:3948) igt-kms-CRITICAL: Failed assertion: drmModeConnectorSetProperty(fd, connector->connector_id, dpms, mode) == 0

It appears as if someone filed both failing test and dmesg-warning to this bug.
Comment 6 Martin Peres 2017-05-24 13:53:03 UTC
(In reply to Marta Löfstedt from comment #5)
> CI_DRM_2640, CI_DRM_2641, CI_DRM_2642, CI_DRM_2644, CI_DRM_2646, CI_DRM_2650
> - Dmesg warn on: [drm:intel_dp_start_link_train [i915]] *ERROR* failed to
> enable link training
> 
> CI_DRM_2643 - Pass
> 
> CI_DRM_2645, CI_DRM_2647, CI_DRM_2648, CI_DRM_2649, CI_DRM_2651 - Fail:
> (kms_flip:3948) igt-kms-CRITICAL: Failed assertion:
> drmModeConnectorSetProperty(fd, connector->connector_id, dpms, mode) == 0
> 
> It appears as if someone filed both failing test and dmesg-warning to this
> bug.

This is what happen when you reduce coverage, you don't see new errors :s

Thanks for reporting this, I'll file a new bug!
Comment 7 Manasi 2017-06-01 23:34:19 UTC
Actually this one is failing because it is not able to do the drm_dp_dpcd_write over AUX is not able to write all the bytes for the DP_TRAINING_PATTERN_SET in intel_dp_set_link_train function. So it never starts link training and returns false. So this is a different case than link training failure handling.

Martin could you please attach a complete dmesg log for this test on SKL 6700hq?

Manasi
Comment 8 Jani Nikula 2017-06-02 07:57:59 UTC
(In reply to Manasi from comment #7)
> Actually this one is failing because it is not able to do the
> drm_dp_dpcd_write over AUX is not able to write all the bytes for the
> DP_TRAINING_PATTERN_SET in intel_dp_set_link_train function. So it never
> starts link training and returns false. So this is a different case than
> link training failure handling.

Well it is *handled* in intel_dp_start_link_train() as if link training failure happened. The question is, would it have worked if we still plunged on with channel equalisation regardless of errors?
Comment 9 Jani Nikula 2017-06-02 07:58:57 UTC
(In reply to Jani Nikula from comment #8)
> Well it is *handled* in intel_dp_start_link_train() as if link training
> failure happened. The question is, would it have worked if we still plunged
> on with channel equalisation regardless of errors?

See bug 101167.
Comment 10 Manasi 2017-06-02 16:02:33 UTC
But if the aux transactions are failing then it is a real failure and we should not try the fallback I think.
So how do we differentiate between this error and the actual link training error.

Manasi
Comment 11 Martin Peres 2017-06-02 16:04:24 UTC
(In reply to Manasi from comment #10)
> But if the aux transactions are failing then it is a real failure and we
> should not try the fallback I think.
> So how do we differentiate between this error and the actual link training
> error.
> 
> Manasi

Could we set a work item that would wait for 250ms and try again the mode-set, and only set the link-training issue at this point?

I mean, if a panel does not answer to AUX requests, it is probably not up yet, right?
Comment 12 Marta Löfstedt 2017-06-05 06:11:25 UTC
Created attachment 131702 [details]
dmesg when only dmesg-warn
Comment 13 Marta Löfstedt 2017-06-05 06:13:32 UTC
(In reply to Manasi from comment #7)
...
> Martin could you please attach a complete dmesg log for this test on SKL
> 6700hq?
> 
> Manasi

I added dmesg from:
https://intel-gfx-ci.01.org/CI/CI_DRM_2640/fi-skl-6700hq/igt@kms_flip@basic-flip-vs-dpms.html

This is when we only have dmesg-warn which this bug is about, when we also fail the test please see bug: https://bugs.freedesktop.org/show_bug.cgi?id=101167
Comment 14 Manasi 2017-06-06 18:52:38 UTC
(In reply to Martin Peres from comment #11)
> (In reply to Manasi from comment #10)
> > But if the aux transactions are failing then it is a real failure and we
> > should not try the fallback I think.
> > So how do we differentiate between this error and the actual link training
> > error.
> > 
> > Manasi
> 
> Could we set a work item that would wait for 250ms and try again the
> mode-set, and only set the link-training issue at this point?
> 
> I mean, if a panel does not answer to AUX requests, it is probably not up
> yet, right?




Hi Martin,

But that should happen even before starting the link training, if the AUX is not up yet, then we should not go ahead and do the clock recovery phase at all, why not wait here for 250ms and then find some way to just fail without attempting link training.

I am trying to find out at what point can we first check for dp aux transactions and wait longer to try again?

Manasi
Comment 15 Martin Peres 2017-06-06 19:12:33 UTC
(In reply to Manasi from comment #14)
> (In reply to Martin Peres from comment #11)
> > (In reply to Manasi from comment #10)
> > > But if the aux transactions are failing then it is a real failure and we
> > > should not try the fallback I think.
> > > So how do we differentiate between this error and the actual link training
> > > error.
> > > 
> > > Manasi
> > 
> > Could we set a work item that would wait for 250ms and try again the
> > mode-set, and only set the link-training issue at this point?
> > 
> > I mean, if a panel does not answer to AUX requests, it is probably not up
> > yet, right?
> 
> Hi Martin,
> 
> But that should happen even before starting the link training, if the AUX is
> not up yet, then we should not go ahead and do the clock recovery phase at
> all, why not wait here for 250ms and then find some way to just fail without
> attempting link training.

I fully agree with the beginning. We should not go to the clock recovery phase if we did not even go through link training at all.

However, not sure we can just wait there for 250ms before re-trying. This is a question for a kernel developer, as I never even read this code.

> 
> I am trying to find out at what point can we first check for dp aux
> transactions and wait longer to try again?

Yes, that sounds like the right thing to do! Maybe Jani can chime in here :)
Comment 16 Ville Syrjala 2017-06-07 13:20:06 UTC
One peculiar thing see in some of the logs at least is a double long HPD when we turn on the panel. I wonder if there's some kind of transient glitch on the VDD line or something that causes to panel to do some kind of off->on->off->on transition.

To get some more debug information maybe someone could hack up a test patch that checks the state of the HPD pin in a few places during the modeset; Eg. just after the panel supposedly powered on and just before starting link training, and maybe in a few places in between.

What kind of machine is this? I'm wondering of the panel power delays might be off somehow, eg. the power cycle delay might be too short and the panel doesn't like it if we power it back up too soon...
Comment 17 Jani Nikula 2017-06-07 16:07:46 UTC
(In reply to Manasi from comment #14)
> I am trying to find out at what point can we first check for dp aux
> transactions and wait longer to try again?

Can you explain how we'd end up in link training if native aux was failing altogether? It must have worked for some time before link training, right?

Ville's explanation would cover this as well.
Comment 18 Marta Löfstedt 2017-06-08 06:48:27 UTC
I found another skl lap-top to test on. The issue can not be reproduced. So, at least we now have some facts that something is different with this specific machine.
Comment 19 Marta Löfstedt 2017-06-08 07:00:11 UTC
(In reply to Ville Syrjala from comment #16)
> What kind of machine is this? I'm wondering of the panel power delays might
> be off somehow, eg. the power cycle delay might be too short and the panel
> doesn't like it if we power it back up too soon...

Ville, information about the lab machines is here:
https://intel-gfx-ci.01.org/CI/hardware.html

fi-skl-6700hq	Toshiba Satellite P50-C-18C	Skylake i7-6700hq	HD Graphics 530 / GT2	eDP (HDMI)
Comment 20 Marta Löfstedt 2017-06-08 10:00:43 UTC
Since, we don't have batteries in the lap-tops in the lab. I decided to test the far fetched idea that maybe the powering-up timing on this device could somehow be affected by battery being present or not.
But this turned out to give no difference in the results.
Comment 21 Marta Löfstedt 2017-06-09 06:34:04 UTC
(In reply to Ville Syrjala from comment #16)
> One peculiar thing see in some of the logs at least is a double long HPD
> when we turn on the panel. I wonder if there's some kind of transient glitch
> on the VDD line or something that causes to panel to do some kind of
> off->on->off->on transition.
> 
> To get some more debug information maybe someone could hack up a test patch
> that checks the state of the HPD pin in a few places during the modeset; Eg.
> just after the panel supposedly powered on and just before starting link
> training, and maybe in a few places in between.
> 
> What kind of machine is this? I'm wondering of the panel power delays might
> be off somehow, eg. the power cycle delay might be too short and the panel
> doesn't like it if we power it back up too soon...

Ville, I have looked through a lot of the dmesg and this double long HPD thing appear to always happen before we get the aux timeout and after that the link training dmesg-warn (and after that the bug 101167 issue), actually I have even seen 3 long HPDs in a short time.
Comment 22 Manasi 2017-06-12 21:27:38 UTC
(In reply to Jani Nikula from comment #17)
> (In reply to Manasi from comment #14)
> > I am trying to find out at what point can we first check for dp aux
> > transactions and wait longer to try again?
> 
> Can you explain how we'd end up in link training if native aux was failing
> altogether? It must have worked for some time before link training, right?
> 
> Ville's explanation would cover this as well.

Actually it fails inside the function intel_dp_link_training_clock_reciovery() when it tries to write the TP1 and errors out saying "Failed to enable link training".
Yes but it worked fine for DPCD reads over AUX, so that's why I am confused as to where we need to add the delay for the AUX reads?

Also, shouldn't the spec for that specific panel or the eDP spec talk about how much delay should be given for the AUX read/writes?
Any thoughts on double checking this in the spec?

Manasi
Comment 23 Marta Löfstedt 2017-06-13 10:12:06 UTC
(In reply to Manasi from comment #22)
> (In reply to Jani Nikula from comment #17)
> > (In reply to Manasi from comment #14)
> > > I am trying to find out at what point can we first check for dp aux
> > > transactions and wait longer to try again?
> > 
> > Can you explain how we'd end up in link training if native aux was failing
> > altogether? It must have worked for some time before link training, right?
> > 
> > Ville's explanation would cover this as well.
> 
> Actually it fails inside the function
> intel_dp_link_training_clock_reciovery() when it tries to write the TP1 and
> errors out saying "Failed to enable link training".
> Yes but it worked fine for DPCD reads over AUX, so that's why I am confused
> as to where we need to add the delay for the AUX reads?
> 
> Also, shouldn't the spec for that specific panel or the eDP spec talk about
> how much delay should be given for the AUX read/writes?
> Any thoughts on double checking this in the spec?
> 
> Manasi

Manasi, I believe that one theory is that device have actually become broken/unstable and that we should try to prove this somehow.

If you or Ville has some ideas on how to prove/disprove this, could you try this on the try-bot, or provide a patch that I can run on the machine early in the morning when the CI is usually idle.
Comment 24 Manasi 2017-06-13 16:08:26 UTC
That's is why I was trying to see if we could find an exact same machine with the same eDP panel here and if we cant reproduce then we know that may be that device is unstable.

Do you have the remote login details for that system? Else I will work on a patch to give to you for testing.

Manasi
Comment 25 Manasi 2017-06-14 02:18:54 UTC
Created attachment 131941 [details] [review]
Test Patch to see if powering up the panel again helps
Comment 26 Manasi 2017-06-14 02:21:57 UTC
Hi Marta,

After looking through the dmesg logs, looks like in the modeset in intel_ddi_pre_enable_dp(), after the edp panel is powered on, it gets two /three HPD pulses after which it shows AUX ch timeouts. That could be because the panel is not powered on or eDP connector is disconnected. 

The attached patch adds debug print for checking the connector status here and also powers up panel again just before link training.

Could you please test it out and see what happens and attach the debug log?

Manasi



(In reply to Marta Löfstedt from comment #23)
> (In reply to Manasi from comment #22)
> > (In reply to Jani Nikula from comment #17)
> > > (In reply to Manasi from comment #14)
> > > > I am trying to find out at what point can we first check for dp aux
> > > > transactions and wait longer to try again?
> > > 
> > > Can you explain how we'd end up in link training if native aux was failing
> > > altogether? It must have worked for some time before link training, right?
> > > 
> > > Ville's explanation would cover this as well.
> > 
> > Actually it fails inside the function
> > intel_dp_link_training_clock_reciovery() when it tries to write the TP1 and
> > errors out saying "Failed to enable link training".
> > Yes but it worked fine for DPCD reads over AUX, so that's why I am confused
> > as to where we need to add the delay for the AUX reads?
> > 
> > Also, shouldn't the spec for that specific panel or the eDP spec talk about
> > how much delay should be given for the AUX read/writes?
> > Any thoughts on double checking this in the spec?
> > 
> > Manasi
> 
> Manasi, I believe that one theory is that device have actually become
> broken/unstable and that we should try to prove this somehow.
> 
> If you or Ville has some ideas on how to prove/disprove this, could you try
> this on the try-bot, or provide a patch that I can run on the machine early
> in the morning when the CI is usually idle.
Comment 27 Marta Löfstedt 2017-06-14 06:44:53 UTC
Created attachment 131945 [details]
dmesg for Mansis test patch

Manasi, here is dmesg with you testpatch.

igt test started:
[   89.255054] [IGT] kms_flip: executing

subtest started:
[   89.273659] [IGT] kms_flip: starting subtest basic-flip-vs-dpms

first everything looks fine:
"[   89.575307] [drm:intel_hpd_irq_handler] Received HPD interrupt on PIN 4 - cnt: 0
...
[   90.345461] [drm:intel_ddi_pre_enable] Manasi Debug [CONNECTOR:48:eDP-1] status connected 
[   90.345763] [drm:edp_panel_vdd_on] PP_STATUS: 0x80000008 PP_CONTROL: 0x0000000b
[   90.346891] [drm:intel_dp_set_signal_levels] Using signal levels 00000000"

This HPD also goes fine:
[   91.624826] [drm:intel_hpd_irq_handler] digital hpd port A - long
[   91.689787] [drm:intel_ddi_pre_enable] Manasi Debug [CONNECTOR:48:eDP-1] status connected 

Then we run into the usual issue, i.e. double HPDs:
"[   92.925623] [drm:intel_hpd_irq_handler] Received HPD interrupt on PIN 4 - cnt: 1
...
[   92.961714] [drm:intel_hpd_irq_handler] Received HPD interrupt on PIN 4 - cnt: 2
...
[   92.990129] [drm:intel_ddi_pre_enable] Manasi Debug [CONNECTOR:48:eDP-1]" 

Then the usual aux timeouts:
"[   92.990367] [drm:edp_panel_vdd_on] Turning eDP port A VDD on
[   92.990414] [drm:edp_panel_vdd_on] PP_STATUS: 0x80000008 PP_CONTROL: 0x0000000b
[   92.998844] [drm:intel_dp_aux_ch] dp_aux_ch timeout status 0x7c1003ff"

and then:
[   94.616008] [drm:intel_dp_start_link_train] *ERROR* failed to enable link training
Comment 28 Manasi 2017-06-15 01:00:24 UTC
Hi Marta,

Yes that didn't help since the driver thinks that the eDP panel is already ON, but may be its turning on and off many times generating multiple HPD pulses. So just disregard the patch.

One more thing I want you to try is in the igt test itself (kms_flip.c), in the run_test_on_crtc_set() function, I see that kmstest_unset_all_crtcs() is called which actually powers off the panel. Could you add a usleep() call here to sleep for a few ms before actually doing a modeset where it will power the eDP panel back on.
This might give some time to the panel power sequencer to stabilize before turning it back on.

This is one of the tests, I have few more things to debug in mind and I will send details regarding the second test soon.
Comment 29 Manasi 2017-06-15 01:48:13 UTC
Hi Marta,

This is the second test I want you to do with the attached patch

Manasi
Comment 30 Manasi 2017-06-15 01:49:13 UTC
Created attachment 131965 [details] [review]
debug_patch_0614
Comment 31 Manasi 2017-06-15 01:52:29 UTC
(In reply to Ville Syrjala from comment #16)
> One peculiar thing see in some of the logs at least is a double long HPD
> when we turn on the panel. I wonder if there's some kind of transient glitch
> on the VDD line or something that causes to panel to do some kind of
> off->on->off->on transition.
> 
> To get some more debug information maybe someone could hack up a test patch
> that checks the state of the HPD pin in a few places during the modeset; Eg.
> just after the panel supposedly powered on and just before starting link
> training, and maybe in a few places in between.
> 
> What kind of machine is this? I'm wondering of the panel power delays might
> be off somehow, eg. the power cycle delay might be too short and the panel
> doesn't like it if we power it back up too soon...


Hi Ville,

I want to add the check for HPD live status after the intel_edp_panel_on() call in intel_ddi_pre_enable_dp() and right before intel_dp_start_link_train().
I am looking at the register PORT_HOTPLUG_STAT in the bspec that gives the live status. But it does not have a bit for Port A, how do I check the live status for Port A?
Comment 32 Marta Löfstedt 2017-06-15 06:16:25 UTC
(In reply to Manasi from comment #28)
> Hi Marta,
> 
> Yes that didn't help since the driver thinks that the eDP panel is already
> ON, but may be its turning on and off many times generating multiple HPD
> pulses. So just disregard the patch.
> 
> One more thing I want you to try is in the igt test itself (kms_flip.c), in
> the run_test_on_crtc_set() function, I see that kmstest_unset_all_crtcs() is
> called which actually powers off the panel. Could you add a usleep() call
> here to sleep for a few ms before actually doing a modeset where it will
> power the eDP panel back on.
> This might give some time to the panel power sequencer to stabilize before
> turning it back on.
> 
> This is one of the tests, I have few more things to debug in mind and I will
> send details regarding the second test soon.

I tested sleeping for 50ms, but is doesn't help.
Also, the issue can also start at: igt@kms_busy@basic-flip-default-*.
Comment 33 Marta Löfstedt 2017-06-15 06:35:17 UTC
(In reply to Manasi from comment #29)
> Hi Marta,
> 
> This is the second test I want you to do with the attached patch
> 
> Manasi


With the patch that force the eDP to not be turned off, the test pass.

I believe the try-bot would be faster than the Marta-bot for this type of testing ;)

Marta
Comment 34 Manasi 2017-06-16 00:23:36 UTC
Hi Marta,

Could you apply Ville's patch : https://patchwork.freedesktop.org/patch/161828/
and then my debug patch attached?

Then I want you to give me two dmesg logs:

1. Boot the system and capture the dmesg log (Hopefully you get some display on edp panel)
2. Boot the system and run only 1 subtest from kms_flip.c that is giving this error , so just run the flip-vs-dpms test - collect dmesg
3. Boot the system and run some other subtest from kms_flip.c that is not having this error/warning - collect dmesg
4. Boot the system and do a full kms_flip.c run and collect dmesg.

Thanks you so much for helping out. I will try to stay up later in the night so I can be online when you are testing.

Manasi
Comment 35 Manasi 2017-06-16 00:24:06 UTC
Created attachment 131988 [details] [review]
debug_patch_0615
Comment 36 Manasi 2017-06-16 00:33:29 UTC
After finishing the previous tests, could you please do the same tests with Ville's Patch, my debug patch and the third patch (debug_with_panel_power_cycle) attached above?

Thanks a lot for your time.

Manasi
Comment 37 Manasi 2017-06-16 00:34:28 UTC
Created attachment 131989 [details] [review]
debug_with_power_cycle_0615
Comment 38 Marta Löfstedt 2017-06-16 07:10:42 UTC
Created attachment 131996 [details]
dmesg 2

Dmesg for Manasis requests:

1. Boot the system and capture the dmesg log (Hopefully you get some display on edp panel)
2. Boot the system and run only 1 subtest from kms_flip.c that is giving this error , so just run the flip-vs-dpms test - collect dmesg
Comment 39 Marta Löfstedt 2017-06-16 07:11:50 UTC
Created attachment 131997 [details]
Dmesg 3

Dmesg for Manasis request:

3. Boot the system and run some other subtest from kms_flip.c that is not having this error/warning - collect dmesg
Comment 40 Marta Löfstedt 2017-06-16 07:13:06 UTC
Created attachment 131998 [details]
Dmesg 4

Dmesg for Manasis request:

4. Boot the system and do a full kms_flip.c run and collect dmesg.
Comment 41 Marta Löfstedt 2017-06-16 07:50:01 UTC
Created attachment 131999 [details]
dmesg 2 debug_with_panel_power_cycle

dmesg for Manasis request:
After finishing the previous tests, could you please do the same tests with Ville's Patch, my debug patch and the third patch (debug_with_panel_power_cycle) attached above?
Comment 42 Marta Löfstedt 2017-06-16 08:34:15 UTC
Manasi, I added a tweak to you last patch:

diff --git a/drivers/gpu/drm/i915/intel_ddi.c b/drivers/gpu/drm/i915/intel_ddi.c
index 2750633..dff8830 100644
--- a/drivers/gpu/drm/i915/intel_ddi.c
+++ b/drivers/gpu/drm/i915/intel_ddi.c
@@ -1669,6 +1669,7 @@ static void intel_ddi_pre_enable_dp(struct intel_encoder *encoder,
                DRM_DEBUG_KMS("\nManasi Debug: HPD Live status after turning Panel On = %d", test);
                if (!test) {
                        DRM_DEBUG_KMS("\nManasi: HPD not active still so power off and power on");
+                       intel_dp->want_panel_vdd = true;
                        intel_edp_panel_off(intel_dp);
                        test = intel_digital_port_connected(dev_priv, dig_port);
                        DRM_DEBUG_KMS("\nManasi Debug: HPD Live status after turning Panel On = %d", test);


Now the test pass.
Comment 43 Marta Löfstedt 2017-06-16 08:42:19 UTC
Created attachment 132000 [details]
Ville + 2 Manasi + tweak

This is dmesg where the test passed a couple of times.
Comment 44 Marta Löfstedt 2017-06-16 08:54:24 UTC
NOTE: I have now managed to hang the machine hard with the patches: 
Ville + 2 x Manasi + tweak. 

So, I assume the turn off then on again needs to be done properly.

I have also managed to fail the test with above patches.
Comment 45 Manasi 2017-06-16 16:26:44 UTC
Thanks for tests, I will analyze all the collected logs now.
Comment 46 Manasi 2017-06-16 16:27:27 UTC
You said the test passes with these, does it hang as well?




(In reply to Marta Löfstedt from comment #44)
> NOTE: I have now managed to hang the machine hard with the patches: 
> Ville + 2 x Manasi + tweak. 
> 
> So, I assume the turn off then on again needs to be done properly.
> 
> I have also managed to fail the test with above patches.
Comment 47 Manasi 2017-06-16 21:47:02 UTC
(In reply to Marta Löfstedt from comment #38)
> Created attachment 131996 [details]
> dmesg 2
> 
> Dmesg for Manasis requests:
> 
> 1. Boot the system and capture the dmesg log (Hopefully you get some display
> on edp panel)
> 2. Boot the system and run only 1 subtest from kms_flip.c that is giving
> this error , so just run the flip-vs-dpms test - collect dmesg

Thanks Marta.
Does this include the entire log when the system is booting up?
If it is then it is a problem because I do not see the DMESG statements in intel_dp_init_panel_power_sequencer() which means it might not have initialized the delays for power on and off correctly.

Can you make sure this includes the dmesg log at the system boot time when it will initialize the edp panel with the VBT values.

Manasi
Comment 48 Manasi 2017-06-16 22:43:32 UTC
(In reply to Marta Löfstedt from comment #43)
> Created attachment 132000 [details]
> Ville + 2 Manasi + tweak
> 
> This is dmesg where the test passed a couple of times.

I donno how the test passed here since the dmesg log still shoes no difference and still see the AUX CH timeouts. Is this the correct log?

Manai
Comment 49 Manasi 2017-06-20 00:28:29 UTC
I have sent several patch sets to trybot for trying out several experiments with panel power delays. Lets see how the results look.

Manasi
Comment 50 Manasi 2017-06-20 23:44:06 UTC
I tried a few things on Trybot, but I am guessing the system probably needs a hard reboot or something for the edp panel to be back on

Marta,

Could you first test the attached patches set 1?

And then also test patches Set 2?

Manasi
Comment 51 Manasi 2017-06-20 23:45:28 UTC
Created attachment 132103 [details] [review]
1_Patch_set_1
Comment 52 Manasi 2017-06-20 23:46:06 UTC
Created attachment 132104 [details] [review]
2_Patch_Set_1
Comment 53 Manasi 2017-06-20 23:46:28 UTC
Created attachment 132105 [details] [review]
3_Patch_Set_1
Comment 54 Manasi 2017-06-20 23:46:53 UTC
Created attachment 132106 [details] [review]
1_Patch_Set_2
Comment 55 Manasi 2017-06-20 23:47:20 UTC
Created attachment 132107 [details] [review]
2_Patch_Set_2
Comment 56 Manasi 2017-06-20 23:47:39 UTC
Created attachment 132108 [details] [review]
3_Patch_Set_3
Comment 57 Manasi 2017-06-20 23:48:52 UTC
Comment on attachment 132108 [details] [review]
3_Patch_Set_3

This is the 3 rd patch from Patch set 2
Comment 58 Manasi 2017-06-20 23:50:37 UTC
Actually I figured the Patchwork link would probably be the best:

Test 1:
https://patchwork.freedesktop.org/series/26090/

Test2:
https://patchwork.freedesktop.org/series/26091/
Comment 59 Manasi 2017-06-21 20:52:14 UTC
Here's an update on this bug:

Found that this specific panel requires a higher panel power cycle delay no less than 800msecs which is 300ms greater than the eDP min value. With this value it is able to power cycle the edp panel correctly and can turn it back on correctly without any aux ch timeouts. 

I tried this fix on Trybot multiple times and consistently got all Success on 6700hq system:
https://patchwork.freedesktop.org/series/26172/

I am working on a patch that would add it in the quirk database for this panel.

JaniN, Ville let me know what you think about this quirk?

Manasi
Comment 60 Marta Löfstedt 2017-06-22 05:58:47 UTC
(In reply to Manasi from comment #59)
> Here's an update on this bug:
> 
> Found that this specific panel requires a higher panel power cycle delay no
> less than 800msecs which is 300ms greater than the eDP min value. With this
> value it is able to power cycle the edp panel correctly and can turn it back
> on correctly without any aux ch timeouts. 
> 
> I tried this fix on Trybot multiple times and consistently got all Success
> on 6700hq system:
> https://patchwork.freedesktop.org/series/26172/
> 
> I am working on a patch that would add it in the quirk database for this
> panel.
> 
> JaniN, Ville let me know what you think about this quirk?
> 
> Manasi

A quirk for this panels sound good to me. We have run the test on another eDP SKL laptop and this issue couldn't be reproduced.
Comment 61 Manasi 2017-06-22 23:34:34 UTC
I am working on adding this to the EDID quirk database.
Marta, could you please give me EDID hexdump on that system?

cd /sys/class/drm/card0-edp/
hexdump -C edid

I will need EDID to find the correct Product ID and Vendor ID to add to the quirk database.

Thanks
Manasi
Comment 62 Marta Löfstedt 2017-06-26 07:23:53 UTC
00000000  00 ff ff ff ff ff ff 00  30 e4 70 04 00 00 00 00  |........0.p.....|
00000010  00 18 01 04 95 23 13 78  ea dc 95 a3 58 55 a0 26  |.....#.x....XU.&|
00000020  0d 50 54 00 00 00 01 01  01 01 01 01 01 01 01 01  |.PT.............|
00000030  01 01 01 01 01 01 2e 36  80 a0 70 38 1f 40 30 20  |.......6..p8.@0 |
00000040  35 00 59 c2 10 00 00 1a  00 00 00 00 00 00 00 00  |5.Y.............|
00000050  00 00 00 00 00 00 00 00  00 00 00 00 00 fe 00 4c  |...............L|
00000060  47 20 44 69 73 70 6c 61  79 0a 20 20 00 00 00 fe  |G Display.  ....|
00000070  00 4c 50 31 35 36 57 46  36 2d 53 50 41 31 00 7b  |.LP156WF6-SPA1.{|
00000080
Comment 63 Ville Syrjala 2017-06-27 10:37:13 UTC
Couple of observations I made:

[    5.001356] [drm:intel_pps_dump_state [i915]] cur t1_t3 0 t8 0 t9 0 t10 500 t11_t12 6000
[    5.001412] [drm:intel_pps_dump_state [i915]] vbt t1_t3 2000 t8 2000 t9 2000 t10 500 t11_t12 5000

So the BIOS doesn't program the t1+t3 delay at all. Not sure why. The VBT seems
to have these pps delays for most of the panel indexes:
 Power Sequence: T3 2000 T7 10 T9 2000 T10 500 T12 5000
except panel 7 (which is what we end up using:
 Power Sequence: T3 2000 T7 2000 T9 2000 T10 500 T12 5000
So only T7 is changed, which shouldn't matter anyway.

The panel spec would appear to list the following requirements:
T1 0.5-10 ms
T2/T3 0-200 ms
T11 <= 10ms
T12 >= 120 ms

So it looks like the values we use should be fine in theory.
Comment 64 Manasi 2017-06-28 20:44:07 UTC
Quirk patches have been submitted to the intel-gfx M-L and it is passing all the CI tests now. Needs to be reviewed by Ville
Comment 65 Martin Peres 2017-07-05 07:40:21 UTC
This fixed the issue:
c99a259 drm/i915/edp: Add a T12 panel delay quirk to fix DP AUX CH timeouts
Comment 66 Martin Peres 2017-07-05 07:41:01 UTC
No need to ask QA to verify, so forcing to CLOSED.
Comment 67 Martin Peres 2017-07-05 10:10:27 UTC
Oh well, this is still visible, even without the AUX timeouts:
https://intel-gfx-ci.01.org/CI/CI_DRM_2801/fi-skl-6700hq/igt@kms_busy@basic-flip-default-b.html
Comment 68 Manasi 2017-07-05 22:51:01 UTC
There is AUX timeouts in this as well, but the delay is set to 800 ms as per the quirk. So for now I have started to believe that this system is probably really unstable and is giving AUX timeouts without any certain pattern/behavior. We should remove this system from CI
Comment 69 Martin Peres 2017-07-06 12:06:40 UTC
(In reply to Manasi from comment #68)
> There is AUX timeouts in this as well, but the delay is set to 800 ms as per
> the quirk. So for now I have started to believe that this system is probably
> really unstable and is giving AUX timeouts without any certain
> pattern/behavior. We should remove this system from CI

Isn't that a good test for IGT then? Right now, IGT assumes that it can always apply a resolution, and that is not true for DP.

We may end up getting rid of the system, but what I see right now is software that is not ready for the new requirements forced upon us by the display port.
Comment 70 Manasi 2017-07-17 18:20:44 UTC
But the error is seen because of AUX timeouts and this is happening because the systems looks unstable, because despite the edp panel power cycle delays as per the spec and increasing them as per the quirk, we still see the AUX timeouts. And since this is very specific to this system and this test, this is more of an issue with the system and not the SW.

Manasi
Comment 71 Manasi 2017-07-24 19:19:19 UTC
Martin, is this still seen on CI? I tried increasing the delay further and I see it only runs 3 IGT tests and those are all passing on SKL 6700hq system.

Manasi
Comment 72 Martin Peres 2017-07-24 19:28:03 UTC
(In reply to Manasi from comment #71)
> Martin, is this still seen on CI? I tried increasing the delay further and I
> see it only runs 3 IGT tests and those are all passing on SKL 6700hq system.
> 
> Manasi

Indeed, they have all been passing since the run after I re-opened the bug. Maybe there was more than one run in the queue and I got confused? Well, let's hope so, but I am sorry for wasting your time by re-opening the bug too quickly :s

I'll close the bug!
Comment 74 Jani Saarinen 2017-08-12 09:04:36 UTC
add one more: igt@kms_busy@basic-flip-a
Comment 75 Manasi 2017-08-14 18:45:59 UTC
From the logs it looks like the quirk is getting applied but still there are aux channel timeouts. 
Ia m going to try increasing the power cycle delay even more and see what happens.
This system is connected to Trybot right so I can do some tests by pushing patches to trybot?

Manasi
Comment 77 Jani Saarinen 2017-09-05 11:53:26 UTC
But now system has been quite stable. Any reason why?
Comment 78 Jani Saarinen 2017-09-05 14:32:23 UTC
Lets resolve now and wait to pop up again.
Comment 79 Jani Saarinen 2017-09-07 18:11:37 UTC
Again unfortunately.
Comment 80 Manasi 2017-09-07 18:13:38 UTC
Hmm, why is this system so flaky...
Do you have the latest logs where you saw this again? I will start working on this bug as the topmost priority.

Manasi
Comment 82 Manasi 2017-09-07 18:39:27 UTC
Is it possible to remove the system from CI and give me a remote access to this so I can run this specific IGT test on that since the error could be in the way the test is power cycling the panel.

Manasi
Comment 83 Jani Saarinen 2017-10-05 07:00:10 UTC
Series merged
https://patchwork.freedesktop.org/series/31361/
Comment 84 Marta Löfstedt 2017-12-12 07:02:50 UTC
Unfortunately this issue appear to be back on IGT_4057:

https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4057/fi-skl-6700hq/igt@kms_busy@basic-flip-b.html
dmesg-warn:
[  302.302951] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to enable link training

then this is folloed by dmesg-fail or warn on on the following tests:
(kms_busy:3998) igt-kms-WARNING: connector 59/eDP-1 has no modes
[  310.115358] [drm:intel_dp_check_link_status [i915]] *ERROR* Failed to get link status
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4057/fi-skl-6700hq/igt@kms_busy@basic-flip-c.html
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4057/fi-skl-6700hq/igt@kms_cursor_legacy@basic-busy-flip-before-cursor-atomic.html
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4057/fi-skl-6700hq/igt@kms_cursor_legacy@basic-busy-flip-before-cursor-legacy.html
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4057/fi-skl-6700hq/igt@kms_cursor_legacy@basic-flip-after-cursor-atomic.html
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4057/fi-skl-6700hq/igt@kms_cursor_legacy@basic-flip-after-cursor-legacy.html
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4057/fi-skl-6700hq/igt@kms_cursor_legacy@basic-flip-after-cursor-varying-size.html
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4057/fi-skl-6700hq/igt@kms_cursor_legacy@basic-flip-before-cursor-atomic.html
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4057/fi-skl-6700hq/igt@kms_cursor_legacy@basic-flip-before-cursor-legacy.html
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4057/fi-skl-6700hq/igt@kms_cursor_legacy@basic-flip-before-cursor-varying-size.html
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4057/fi-skl-6700hq/igt@kms_pipe_crc_basic@bad-nb-words-1.html
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4057/fi-skl-6700hq/igt@kms_pipe_crc_basic@bad-nb-words-3.html
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4057/fi-skl-6700hq/igt@kms_pipe_crc_basic@bad-pipe.html
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4057/fi-skl-6700hq/igt@kms_pipe_crc_basic@bad-source.html
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4057/fi-skl-6700hq/igt@kms_pipe_crc_basic@hang-read-crc-pipe-a.html
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4057/fi-skl-6700hq/igt@kms_pipe_crc_basic@hang-read-crc-pipe-b.html
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4057/fi-skl-6700hq/igt@kms_pipe_crc_basic@hang-read-crc-pipe-c.html
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4057/fi-skl-6700hq/igt@kms_pipe_crc_basic@nonblocking-crc-pipe-a.html
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4057/fi-skl-6700hq/igt@kms_pipe_crc_basic@nonblocking-crc-pipe-a-frame-sequence.html
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4057/fi-skl-6700hq/igt@kms_pipe_crc_basic@nonblocking-crc-pipe-b.html
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4057/fi-skl-6700hq/igt@kms_pipe_crc_basic@nonblocking-crc-pipe-b-frame-sequence.html
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4057/fi-skl-6700hq/igt@kms_pipe_crc_basic@nonblocking-crc-pipe-c.html
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4057/fi-skl-6700hq/igt@kms_pipe_crc_basic@nonblocking-crc-pipe-c-frame-sequence.html
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4057/fi-skl-6700hq/igt@kms_pipe_crc_basic@read-crc-pipe-a.html
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4057/fi-skl-6700hq/igt@kms_pipe_crc_basic@read-crc-pipe-a-frame-sequence.html
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4057/fi-skl-6700hq/igt@kms_pipe_crc_basic@read-crc-pipe-b.html
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4057/fi-skl-6700hq/igt@kms_pipe_crc_basic@read-crc-pipe-b-frame-sequence.html
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4057/fi-skl-6700hq/igt@kms_pipe_crc_basic@read-crc-pipe-c.html
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4057/fi-skl-6700hq/igt@kms_pipe_crc_basic@read-crc-pipe-c-frame-sequence.html
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4057/fi-skl-6700hq/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-a.html
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4057/fi-skl-6700hq/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-b.html
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4057/fi-skl-6700hq/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-c.html
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4057/fi-skl-6700hq/igt@kms_psr_sink_crc@psr_basic.html
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4057/fi-skl-6700hq/igt@kms_setmode@basic-clone-single-crtc.html
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4057/fi-skl-6700hq/igt@kms_sink_crc_basic.html
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4057/fi-skl-6700hq/igt@pm_backlight@basic-brightness.html
Comment 85 Marta Löfstedt 2017-12-12 10:18:59 UTC
The issue was not reproduced on IGT_4058, so I don't believe this is a simple regression.
Comment 86 Manasi 2017-12-19 00:13:23 UTC
I really feel that this system SKL 6700hq is unstable since this happens because the panel doesnt turn on inspite of the quirk for increasing the panel power cycle delay. With that delay it works for few days and then needs increased delay.
Comment 87 Manasi 2017-12-28 20:35:00 UTC
Is this happening on a KBL system? Could you attach the logs please?

Manasi
Comment 88 Marta Löfstedt 2018-01-08 07:51:17 UTC
I will close this again since the issue has not been seen since that bad run.
Comment 89 Marta Löfstedt 2018-01-15 06:54:21 UTC
Unfortunately I have to reopen this again:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3628/fi-skl-6700hq/igt@kms_flip@basic-flip-vs-dpms.html
Comment 90 Marta Löfstedt 2018-01-18 13:42:34 UTC
Here the issue happened again:
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4146/issues.html
Comment 91 Marta Löfstedt 2018-03-08 08:14:54 UTC
I will close this again since it hasn't happened for a long time.
Comment 92 Marta Löfstedt 2018-03-16 07:31:37 UTC
Reopened since the issue was hit while running the shards testlist on BAT machines:

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_1/fi-skl-6700hq/igt@kms_flip@basic-flip-vs-modeset.html

[  594.867729] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to enable link training
[  594.867808] [drm:intel_dp_start_link_train [i915]] *ERROR* [CONNECTOR:71:eDP-1] Link Training failed at link rate = 270000, lane count = 2
Comment 93 Marta Löfstedt 2018-04-11 09:24:06 UTC
I am going to archive this old bug from cibuglog and create a new one.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.