86201 – [VLV Bisected] Delayed boot and distorted text console after commit 773538e8

Bug 86201 - [VLV Bisected] Delayed boot and distorted text console after commit 773538e8

Summary: [VLV Bisected] Delayed boot and distorted text console after commit 773538e8

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	highest normal
Assignee:	Ville Syrjala
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2014-11-12 14:22 UTC by Egbert Eich
Modified:	2017-07-24 22:50 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:
i915 features:

Attachments
dmesg log showing delay (78.91 KB, text/plain) 2014-11-12 14:22 UTC, Egbert Eich	no flags	Details
dmesg log of kernel prior to offending patch (76.50 KB, text/plain) 2014-11-12 14:23 UTC, Egbert Eich	no flags	Details
[PATCH] drm/i915: Try to avoid pps_{lock,unlock}() on DP ports (2.80 KB, patch) 2014-11-18 13:08 UTC, Ville Syrjala	no flags	Details \| Splinter Review
Fix: When enabling panel VDD cancel pending disable worker (1007 bytes, patch) 2014-11-24 16:27 UTC, Egbert Eich	no flags	Details \| Splinter Review
[PATCH] drm/i915: Kill check_power_well() calls (2.92 KB, patch) 2014-12-17 15:32 UTC, Ville Syrjala	no flags	Details \| Splinter Review
View All

Description Egbert Eich 2014-11-12 14:22:40 UTC

Created attachment 109341 [details]
dmesg log showing delay

Boot is delayed and the text console is flickering & distorted after commit 

commit 773538e86081d146e0020435d614f4b96996c1f9
Author: Ville Syrjälä <ville.syrjala@linux.intel.com>
Date:   Thu Sep 4 14:54:56 2014 +0300

    drm/i915: Reset power sequencer pipe tracking when disp2d is off

has been introduced.

As can be seen from the kernel logs there are delays of several seconds where power to eDP is turned on and off.
Once the system is up the fbdev text console is flickering.
The display is an eDP connected panel, surprisingly an externally connected (DVI) display flickers as well.

Comment 1 Egbert Eich 2014-11-12 14:23:45 UTC

Created attachment 109343 [details]
dmesg log of kernel prior to offending patch

Comment 2 Egbert Eich 2014-11-12 14:26:51 UTC

PCI ID of GPU: 8086:0f31

Comment 3 Ville Syrjala 2014-11-12 14:48:42 UTC

You have this?

commit 1e74a324465e5a4f8e3ee7c423aef92c54de8a72
Author: Ville Syrjälä <ville.syrjala@linux.intel.com>
Date:   Tue Oct 28 16:15:51 2014 +0200

    drm/i915: Initialize PPS timestamps on vlv/chv

Comment 4 Egbert Eich 2014-11-12 18:59:12 UTC

(In reply to Ville Syrjala from comment #3)
> You have this?
> 
> commit 1e74a324465e5a4f8e3ee7c423aef92c54de8a72
> Author: Ville Syrjälä <ville.syrjala@linux.intel.com>
> Date:   Tue Oct 28 16:15:51 2014 +0200
> 
>     drm/i915: Initialize PPS timestamps on vlv/chv

Yup. 

HEAD of my build is at 

commit ea361c1700ccb5acdc0f5970fd4d64c94c6aa6ee
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Wed Nov 12 10:28:50 2014 +0100

    drm-intel-nightly: 2014y-11m-12d-09h-28m-33s UTC integration manifest

Comment 5 Ville Syrjala 2014-11-18 13:06:00 UTC

(In reply to Egbert Eich from comment #4)
> (In reply to Ville Syrjala from comment #3)
> > You have this?
> > 
> > commit 1e74a324465e5a4f8e3ee7c423aef92c54de8a72
> > Author: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > Date:   Tue Oct 28 16:15:51 2014 +0200
> > 
> >     drm/i915: Initialize PPS timestamps on vlv/chv
> 
> Yup. 
> 
> HEAD of my build is at 
> 
> commit ea361c1700ccb5acdc0f5970fd4d64c94c6aa6ee
> Author: Daniel Vetter <daniel.vetter@ffwll.ch>
> Date:   Wed Nov 12 10:28:50 2014 +0100
> 
>     drm-intel-nightly: 2014y-11m-12d-09h-28m-33s UTC integration manifest

The log is from an older kernel since it doesn't have the port information in the vdd debug messages. But I guess you're sayng it still happens with the more recent kernels too.

Apart from the massive delay there's one other WTF in the logs:
[    3.840671] [drm:intel_dp_init_panel_power_sequencer_registers] panel power sequencer register settings: PP_ON 0x87d00001, PP_OFF 0x1f40001, PP_DIV 0x270f06
[    5.068936] [drm:intel_dp_init_panel_power_sequencer_registers] panel power sequencer register settings: PP_ON 0x80000001, PP_OFF 0x1, PP_DIV 0x270f00

I'm left wondering what happened to the delays we had in those registers. Somehow they got zeroed after we already computed them to !=0 values. Althoguh that issue was already present in the "good" log.

Also can you make edp_panel_vdd_on() dump a backtrace so we can see where it's coming from. I guess it's some AUX stuff (EDID read perhaps). I can't see why it would suddenly take forever. Especially since it doesn't appear to time out or anything.

I don't see any "disabling display" power well debug messages in the log, so at least resetting pps_pipe doesn't seem to happen anywhere. The other part of the patch adds the power_domain get/put around the pps_mutex, but since we have init power enabled here, these should be mostly nops. It does mean we grab and release the power domain mutex a few extra times as well, so there's some overhead, but I would think there would have to some serious contention on the lock to get this kind of effect.

Sadly I can't reproduce this here.

Comment 6 Ville Syrjala 2014-11-18 13:08:04 UTC

Created attachment 109670 [details] [review]
[PATCH] drm/i915: Try to avoid pps_{lock,unlock}() on DP ports

Attempt at reducing the locking overhead for regular DP AUX transfers.

In case the extra locking we now do is really the culprit, can you test this one?

Comment 7 Egbert Eich 2014-11-19 10:46:19 UTC

*** Bug 86202 has been marked as a duplicate of this bug. ***

Comment 8 Egbert Eich 2014-11-20 09:45:13 UTC

(In reply to Ville Syrjala from comment #5)
 
> The log is from an older kernel since it doesn't have the port information
> in the vdd debug messages. But I guess you're sayng it still happens with
> the more recent kernels too.

The dump was from the very version that introduced this regression but I'm seeingthe same behavior on recent intel nightly kernels (as of last week). 
I will generate more dumps on more recent kernels.

> 
> Apart from the massive delay there's one other WTF in the logs:
> [    3.840671] [drm:intel_dp_init_panel_power_sequencer_registers] panel
> power sequencer register settings: PP_ON 0x87d00001, PP_OFF 0x1f40001,
> PP_DIV 0x270f06
> [    5.068936] [drm:intel_dp_init_panel_power_sequencer_registers] panel
> power sequencer register settings: PP_ON 0x80000001, PP_OFF 0x1, PP_DIV
> 0x270f00
> 
> I'm left wondering what happened to the delays we had in those registers.
> Somehow they got zeroed after we already computed them to !=0 values.
> Althoguh that issue was already present in the "good" log.

Right. I did notice those delays as well hadn't paid much attention to them, yet so I didn't check the logs for them explicitly.

As it looks like a v3.13 kernel (with a regression fix for DP on top) is the most stable version on this hardware. I've seen other regressions introduced before the mentioned patch (for instance even without this patch the fb console is not entirely stable) but I haven't bisected these, yet.

> Also can you make edp_panel_vdd_on() dump a backtrace so we can see where
> it's coming from. I guess it's some AUX stuff (EDID read perhaps). I can't
> see why it would suddenly take forever. Especially since it doesn't appear
> to time out or anything.

Will do!
 
> I don't see any "disabling display" power well debug messages in the log, so
> at least resetting pps_pipe doesn't seem to happen anywhere. The other part
> of the patch adds the power_domain get/put around the pps_mutex, but since
> we have init power enabled here, these should be mostly nops. It does mean
> we grab and release the power domain mutex a few extra times as well, so
> there's some overhead, but I would think there would have to some serious
> contention on the lock to get this kind of effect.
> 
> Sadly I can't reproduce this here.

I will play with this some more and also check more recent kernels.

(In reply to Ville Syrjala from comment #6)
> Created attachment 109670 [details] [review] [review]
> [PATCH] drm/i915: Try to avoid pps_{lock,unlock}() on DP ports
> 
> Attempt at reducing the locking overhead for regular DP AUX transfers.
> 
> In case the extra locking we now do is really the culprit, can you test this
> one?

Sure!

Comment 9 Egbert Eich 2014-11-21 09:29:12 UTC

This issue is driving me nuts.

I've built drm-intel-nightly yesterday but unable to reproduce the issue - in fact everything got initialized flawlessly. The patch in comment #6 didn't play a role as
a. The long delay between calls to 
   intel_dp_init_panel_power_sequencer_registers() 
   did not exists.edp_panel_vdd_off_sync
b. edp_panel_vdd_off_sync() was called a lot less often.

In fact the only evidence I had that I was not dreaming last week when I openend this ticket was the log in https://bugs.freedesktop.org/attachment.cgi?id=109342
which clearly indicated that those issues existed in a kernel based on 3.18-rc4.

Comparing this log with the ones I generated yesterday revealed that the drm modules were loaded a lot later during the boot process.
I've now been able to rebuild a kernel which exhibits this issue. I will now go on comparing the configurations of the two builds.

So far it seems that for the 'bad' kernel build the patch in comment #6 at least fixed the problem in the EDID read out which can be seen in
https://bugs.freedesktop.org/attachment.cgi?id=109342
as without it I can easily reproduce the bogus EDIDs.

So for now, let me figure out what is so different in the two builds of exactly the same kernel version.

Comment 10 Egbert Eich 2014-11-24 16:27:51 UTC

Created attachment 109950 [details] [review]
Fix: When enabling panel VDD cancel pending disable  worker

I found the time to debug this.

There are two issues:

1. In edp_panel_vdd_on() we don't cancel a vdd_off worker which may be pending.
   This may lead to the worker firing right in the middle of an EDID transfer.
   While this is most likely not what is expected it didn't cause the EDID to be 
   misread - at least for my panel.
   Attached os a patch that attempts to fix this.

2. The long delay happens when the HZ is set to 250. I don't see it when it is
   set to 1000.
   Here is what's happening: on eDP pps_lock()/unlock() gets called on each low 
   level transfer. This function can be quite expensive however: when checking 
   if the power well is enabled vlv_sideband_rw() gets called this performs its 
   tests in a wait_for() loop. Since these tests don't seem to succeed the 
   first time around the test is redone after an 1 ms wait. With HZ set to 250
   the granularity is higher 1 ms however and those waits add up.
   This can be seen easily by replacing wait_for() by wait_for_atomic() which
   retests after a 0 wait. In this case the long delay period disappears even 
   for HZ=250.
   It is however waistful to perfrom those operations at the lowest level at 
   intel_dp_aux_ch(). 
   It would be better to wrap the entire i2c transfer by single calls to these 
   functions. The current design of drm_dp_helper.c doesn't allow this however.
   I'm going to propose a patchset for this to the ML.

Comment 11 Jesse Barnes 2014-12-10 21:43:40 UTC

The wait_for debugging seems to point at a potential solution.  I think Ville is debugging this along with some internal bugs, so I'll assign to him.

Comment 12 Egbert Eich 2014-12-11 07:28:48 UTC

(In reply to Jesse Barnes from comment #11)
> The wait_for debugging seems to point at a potential solution.  I think
> Ville is debugging this along with some internal bugs, so I'll assign to him.

-ENOCONTXT
I may be missing some discussion here.

The idea to avoid the numerous calls to pps_lock()/unlock() during an EDID transfer by locking on the outer instead of the inner layer to me still seems to have some merit.
Daniel and Ville were afraid that the slightly simplistic solution I proposed won't scale when new DPCP transfers are added. 
Thus it was proposed to add the pps_lock()/unlock() operations to the outer layer where appropriate and extend the locking/unlocking itself with some ref counter to avoid deadlocking by calling these operations on the outer and inner layer.

Not sure if this idea is void now or still worth pursuing - Ville?

Comment 13 Ville Syrjala 2014-12-17 15:32:26 UTC

Created attachment 110958 [details] [review]
[PATCH] drm/i915: Kill check_power_well() calls

Thanks to Jani we may now have a real idea why it gets so slow with the get/put. This patch should hopefully fix it. Please test.

Skipping the pps_lock/unlock for DP ports might still be worthwile though.

Comment 14 Jani Nikula 2014-12-18 10:10:10 UTC

Presumed fixed by

commit 7f1241ed1a06b4846ad7a2a57eb088b757e58e16
Author: Ville Syrjälä <ville.syrjala@linux.intel.com>
Date:   Thu Dec 18 11:44:06 2014 +0200

    drm/i915: Kill check_power_well() calls

in drm-intel-next-fixes (and -nightly). Thanks for the report, please do reopen if the problem persists with this.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.