Bug 108462 - two external screens permanently go blank on HP EliteBook Folio G1
Summary: two external screens permanently go blank on HP EliteBook Folio G1
Status: NEW
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: Karthik B S
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: Triaged
Keywords: regression
Depends on:
Blocks:
 
Reported: 2018-10-16 19:49 UTC by Johannes Berg
Modified: 2018-12-04 10:23 UTC (History)
3 users (show)

See Also:
i915 platform: SKL
i915 features: display/watermark


Attachments
vbios dump (64.00 KB, application/octet-stream)
2018-10-16 19:50 UTC, Johannes Berg
no flags Details
dmesg (7.74 MB, text/x-log)
2018-10-16 19:51 UTC, Johannes Berg
no flags Details
register dump (17.38 KB, text/plain)
2018-10-16 19:52 UTC, Johannes Berg
no flags Details
video showing the issue (10.91 MB, video/mp4)
2018-10-16 19:56 UTC, Johannes Berg
no flags Details
dmesg covering just one or two instances of the problem (165.07 KB, text/plain)
2018-10-16 19:57 UTC, Johannes Berg
no flags Details
config without this problem (130.91 KB, text/plain)
2018-10-27 20:54 UTC, Johannes Berg
no flags Details
config that doesn't boot (136.01 KB, text/plain)
2018-10-27 21:04 UTC, Johannes Berg
no flags Details
trace-cmd recording of i915_reg_rw (1.10 MB, application/x-xz)
2018-11-21 07:32 UTC, Johannes Berg
no flags Details
trace.dat parsed (7.13 MB, text/plain)
2018-12-04 10:08 UTC, Johannes Berg
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Johannes Berg 2018-10-16 19:49:51 UTC
DRM tip tree, commit 90b59df999a1 ("drm-tip: 2018y-10m-15d-20h-57m-27s UTC integration manifest")

Fedora 28 system on an HP EliteBook Folio G1 with Intel(R) Core(TM) m7-6Y75 CPU, x86_64 of course.

I have two external monitors connected:
 * one directly by USB-C display port cable
 * one via docking station (https://www.iogear.com/product/GUD3C01/), connected with HDMI

Both screens frequently go completely blank. Of course now while I was waiting for it with the camera turned on it didn't happen, but basically it just all goes completely black and the displays turn off the backlight temporarily.

This is a regression, but I cannot exactly say when it was introduced. I know it works on 4.13 (which I installed because of this bug...), and I believe it still worked on 4.15, but I don't have that installed now to test.
Comment 1 Johannes Berg 2018-10-16 19:50:11 UTC
Created attachment 142053 [details]
vbios dump
Comment 2 Johannes Berg 2018-10-16 19:51:19 UTC
Created attachment 142054 [details]
dmesg

Note that these attachments are the same as for bug 108460, both problems occurred there, just at different times.
Comment 3 Johannes Berg 2018-10-16 19:52:40 UTC
Created attachment 142055 [details]
register dump
Comment 4 Johannes Berg 2018-10-16 19:56:09 UTC
Created attachment 142056 [details]
video showing the issue
Comment 5 Johannes Berg 2018-10-16 19:56:43 UTC
FWIW, occasionally it only happens to one screen ...
Comment 6 Johannes Berg 2018-10-16 19:57:13 UTC
Created attachment 142057 [details]
dmesg covering just one or two instances of the problem
Comment 7 Lakshmi 2018-10-17 12:13:44 UTC
(In reply to Johannes Berg from comment #0)
> DRM tip tree, commit 90b59df999a1 ("drm-tip: 2018y-10m-15d-20h-57m-27s UTC
> integration manifest")
> 
> Fedora 28 system on an HP EliteBook Folio G1 with Intel(R) Core(TM) m7-6Y75
> CPU, x86_64 of course.
> 
> I have two external monitors connected:
>  * one directly by USB-C display port cable
>  * one via docking station (https://www.iogear.com/product/GUD3C01/),
> connected with HDMI
> 
> Both screens frequently go completely blank. Of course now while I was
> waiting for it with the camera turned on it didn't happen, but basically it
> just all goes completely black and the displays turn off the backlight
> temporarily.

How the screen comes back? Any particular actions will make the screen turn on again?

> This is a regression, but I cannot exactly say when it was introduced. I
> know it works on 4.13 (which I installed because of this bug...), and I
> believe it still worked on 4.15, but I don't have that installed now to test.
Comment 8 Johannes Berg 2018-10-17 12:16:21 UTC
(In reply to Lakshmi from comment #7)

> > Both screens frequently go completely blank. Of course now while I was
> > waiting for it with the camera turned on it didn't happen, but basically it
> > just all goes completely black and the displays turn off the backlight
> > temporarily.
> 
> How the screen comes back? Any particular actions will make the screen turn
> on again?

Oh, they just come back automatically and pretty much immediately, but it's super annoying to work with a system that just decides to turn your screen off and on occasionally :-)
Comment 9 Imre Deak 2018-10-17 16:32:35 UTC
The log at
https://bugs.freedesktop.org/attachment.cgi?id=142054
has pipe underruns on all 3 pipes, so I suspect some watermark problem.

The log at
https://bugs.freedesktop.org/attachment.cgi?id=142057
doesn't have any obvious issues, but that could just be due to underrun reporting being disabled at that time.

Any chance that you could do a bisect?
Comment 10 Johannes Berg 2018-10-17 19:07:26 UTC
(In reply to Imre Deak from comment #9)
> The log at
> https://bugs.freedesktop.org/attachment.cgi?id=142054
> has pipe underruns on all 3 pipes, so I suspect some watermark problem.
> 
> The log at
> https://bugs.freedesktop.org/attachment.cgi?id=142057
> doesn't have any obvious issues, 

I'm pretty sure that the second log had the issue at least once.

> but that could just be due to underrun
> reporting being disabled at that time.

but I suppose that's possible.

> Any chance that you could do a bisect?

Technically yes, since I know it was fine around 4.15 time-frame, but it'll take ... forever, especially on this machine. Any other ideas would be nicer... :-)
Comment 11 Johannes Berg 2018-10-17 19:11:55 UTC
> Technically yes, since I know it was fine around 4.15 time-frame, but it'll
> take ... forever, especially on this machine. Any other ideas would be
> nicer... :-)

That said, any idea which paths I can restrict the bisect to? Maybe I'll try to run it at some point.
Comment 12 Johannes Berg 2018-10-18 09:56:22 UTC
Ok... I started to bisect, but instead of compiling the fedora config I used "make localmodconfig". I'm on 4.17-rc5 now and the issue isn't happening, though I was reasonably sure that it would happen here. I'm compiling 4.19-rc again with my current config to see if it's just the config ... or if it reproduces there.

Any ideas how the config might affect it?

Like I said, I'm not 100% certain it previously occurred on 4.17 with Fedora config, but I thought it did.
Comment 13 Johannes Berg 2018-10-18 10:44:49 UTC
Ok, hmmm. This does seem to depend on the kernel .config, now with the current config ("make localmodconfig") on the same DRM tip tree (commit 90b59df999a1) it hasn't happened yet in a few minutes, which would be almost impossible with the broken kernel...

One (perhaps significant) difference that I notice it that this kernel now shows the 4 boot-time penguins, which is not the case on Fedora's config.

Any thoughts as to what Kconfig knobs might affect this that I can play with?

I can't really bisect if it's a Kconfig issue, and only happens on Fedora's config - that's too big to bisect with. If I can reproduce with a smaller config (and then not reproduce on older kernels) I can attempt the bisect again.
Comment 14 Imre Deak 2018-10-18 15:34:05 UTC
Not sure what Kconfig option would affect this issue.

As another approach to narrow down the problem, could you try - right after triggering the problem - disabling the low-power fifo mode and see if the problem is still reproducible? Please also provide the output for the script:

# cd /sys/kernel/debug/dri/0
# for plane in pri cur spr; do
>       cat i915_${plane}_wm_latency
>       wm0=$(head -1 i915_${plane}_wm_latency|cut -d' ' -f2)
>       echo $wm0 1000 1000 1000 1000 > i915_${plane}_wm_latency
> done
Comment 15 Lakshmi 2018-10-24 10:17:41 UTC
Johannes, Have you tried Imre's suggestion?
Comment 16 Johannes Berg 2018-10-24 10:19:15 UTC
(In reply to Lakshmi from comment #15)
> Johannes, Have you tried Imre's suggestion?

Not yet, unfortunately - I'd still been trying to figure out why my new kernel .config doesn't exhibit the issue, but yeah, I should do that. Perhaps tonight, when I'm off work.
Comment 17 Johannes Berg 2018-10-27 19:13:33 UTC
FWIW, I don't actually know of a way of *triggering* this. It seems to just happen all by itself, sometimes with lots of screen activity, sometimes with none at all.

I noticed that this message *sometimes* seems to coincide with the issue:

[drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe C FIFO underrun

but I suppose that's just be a symptom of the issue, rather than the cause, since it doesn't *always* happen.


Your script doesn't actually work:

jberg1-mobl2:/sys/kernel/debug/dri/0# for plane in pri cur spr; do
> cat i915_${plane}_wm_latency
> wm0=$(head -1 i915_${plane}_wm_latency|cut -d' ' -f2)
> echo $wm0 1000 1000 1000 1000 > i915_${plane}_wm_latency
> done
WM0 2 (2.0 usec)
WM1 19 (19.0 usec)
WM2 28 (28.0 usec)
WM3 32 (32.0 usec)
WM4 63 (63.0 usec)
WM5 77 (77.0 usec)
WM6 83 (83.0 usec)
WM7 99 (99.0 usec)
bash: echo: write error: Invalid argument
WM0 2 (2.0 usec)
WM1 19 (19.0 usec)
WM2 28 (28.0 usec)
WM3 32 (32.0 usec)
WM4 63 (63.0 usec)
WM5 77 (77.0 usec)
WM6 83 (83.0 usec)
WM7 99 (99.0 usec)
bash: echo: write error: Invalid argument
WM0 2 (2.0 usec)
WM1 19 (19.0 usec)
WM2 28 (28.0 usec)
WM3 32 (32.0 usec)
WM4 63 (63.0 usec)
WM5 77 (77.0 usec)
WM6 83 (83.0 usec)
WM7 99 (99.0 usec)
bash: echo: write error: Invalid argument
Comment 18 Imre Deak 2018-10-27 20:48:38 UTC
(In reply to Johannes Berg from comment #17)
> FWIW, I don't actually know of a way of *triggering* this. It seems to just
> happen all by itself, sometimes with lots of screen activity, sometimes with
> none at all.
> 
> I noticed that this message *sometimes* seems to coincide with the issue:
> 
> [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe C FIFO
> underrun
> 
> but I suppose that's just be a symptom of the issue, rather than the cause,
> since it doesn't *always* happen.

It could be a marker that we program wrong watermark levels (which I'd like to test with my script).

> Your script doesn't actually work:
> 
> jberg1-mobl2:/sys/kernel/debug/dri/0# for plane in pri cur spr; do
> > cat i915_${plane}_wm_latency
> > echo $wm 1000 1000 1000 1000 > i915_${plane}_wm_latency
> > done

Ah sorry, you have SKL and so 8 watermark levels not 5. The following should work better, could you try it? Forgot to say, but after running the script you also have to force a display modeset, for example by making both displays blank then unblank. Then just leave it running and see if any FIFO underrun message shows up or if the displays flicker. Thanks.

# for plane in pri cur spr; do
> echo 20 500 500 500 500 500 500 500 > i915_${plane}_wm_latency
> done
Comment 19 Johannes Berg 2018-10-27 20:54:43 UTC
Created attachment 142237 [details]
config without this problem
Comment 20 Johannes Berg 2018-10-27 21:04:17 UTC
Created attachment 142238 [details]
config that doesn't boot

So I've been trying to figure out why one .config works, and another doesn't.

In the process, I've arrived at the previously attached .config (config-working.txt) that doesn't exhibit the issue, but I don't have working sound. Note that this config also doesn't exhibit bug 108462.

Now, since I was trying to _also_ make sound work (since I'm actually fairly happy to have a working .config, except it didn't have sound which is annoying), I've slowly been enabling sound options. My previously attached config-working.txt had some sound options enabled, but still no sound.

Now ... with config-notbooting.txt that I just attached, I've really only changed sound configuration:

+CONFIG_REGMAP_I2C=m
+CONFIG_SND_HDA_EXT_CORE=m
+CONFIG_SND_SOC=m
+CONFIG_SND_SOC_TOPOLOGY=y
+CONFIG_SND_SOC_ACPI=m
+CONFIG_SND_SOC_INTEL_SST_TOPLEVEL=y
+CONFIG_SND_SOC_INTEL_SST=m
+CONFIG_SND_SOC_INTEL_SKYLAKE=m
+CONFIG_SND_SOC_ACPI_INTEL_MATCH=m
+CONFIG_SND_SOC_INTEL_MACH=y
+CONFIG_SND_SOC_I2C_AND_SPI=m

REGMAP_I2C got pulled in extra, apparently.

Here's the weird part: this .config doesn't even boot properly. The external screens are not initialized at all, and even the internal panel doesn't get the right resolution in the fedora boot/disk password screen!! It also doesn't boot fully, I get to enter my password but it doesn't get to the display manager.

Looks like I have a choice between
 * kernel with working graphics and no sound
 * kernel with flickering graphics but sound
 * kernel 4.13 (or perhaps 4.15?)

Can anyone explain to me why selecting the sound options should have any impact on the graphics? Clearly it does though ... I can file a separate bug on that though, if you prefer.
Comment 21 Johannes Berg 2018-10-27 21:05:05 UTC
(In reply to Johannes Berg from comment #20)

> In the process, I've arrived at the previously attached .config
> (config-working.txt) that doesn't exhibit the issue, but I don't have
> working sound. Note that this config also doesn't exhibit bug 108462.

Sorry, I meant bug 108460.

johannes
Comment 22 Johannes Berg 2018-10-27 21:26:43 UTC
> Ah sorry, you have SKL and so 8 watermark levels not 5. The following should
> work better, could you try it? Forgot to say, but after running the script
> you also have to force a display modeset, for example by making both
> displays blank then unblank. Then just leave it running and see if any FIFO
> underrun message shows up or if the displays flicker. Thanks.

It looks like that did indeed help. It's been running for a few minutes without showing the error, and that would've been highly unlikely with the situation before. Still on the same DRM commit mentioned in comment #1, fwiw.

Tell me how _this_ is related to kernel .config though?
Comment 23 Lakshmi 2018-11-02 08:37:29 UTC
Imre, any comments here?
Comment 24 Imre Deak 2018-11-02 13:41:33 UTC
(In reply to Lakshmi from comment #23)
> Imre, any comments here?

I think we should check for missing SKL workarounds related to watermark programming. Ville has said that we are missing a few of those.
Comment 25 Lakshmi 2018-11-14 11:05:26 UTC
Ville, any changes are pushed to drm-tip that helps this issue?
Comment 26 Karthik B S 2018-11-15 10:43:30 UTC
Hi Johannes,

I tried to reproduce the issue at my end with ubuntu16.04 using DRM-TIP(4.20_rc1) kernel, with 2 displays(eDP+HDMI) connected.
I also set the audio parameters in the config file as mentioned in the bug, but I'm unable to reproduce the issue.
Could you please provide the ftrace together with register trace enabled.
(echo 1 > /sys/kernel/debug/tracing/events/i915/i915_reg_rw/enable)
Comment 27 russianneuromancer 2018-11-16 10:28:01 UTC
> Ah sorry, you have SKL and so 8 watermark levels not 5. The following should work better, could you try it?

Imre, your workaround script from Comment 18 helps with bug 103229 (internal screen flicker on same laptop).

Karthik and Imre, if possible, could you please look into bug 103229?
Comment 28 Johannes Berg 2018-11-21 07:32:26 UTC
Created attachment 142533 [details]
trace-cmd recording of i915_reg_rw

Sorry for the delay, Karthik, here's the trace you requested. I think. Only one of the screens went blank towards the end of the file.

If you have something else in mind, I'd appreciate a full trace-cmd record command line.

FWIW, I'm not surprised you're not able to reproduce this, I myself am having a very hard time reproducing on a kernel that doesn't use fedora's configuration.
Comment 29 Karthik B S 2018-12-04 10:03:00 UTC
Hi,

Sorry for the delay in reply.
I actually tried to reproduce the bug at our end multiple times, but have not been successful till now.
Also I'm having some issue with the .dat file format, the file I have seems partially corrupted. A .txt file would suffice.
I've narrowed the ftrace to 4 functions so that the buffer doesn't get overwritten.

Could you please run the below steps.
echo 0 > /sys/kernel/debug/tracing/tracing_on
echo nop > /sys/kernel/debug/tracing/current_tracer
echo "intel_atomic_commit" "intel_atomic_commit_tail" "intel_cpu_fifo_underrun_irq_handler" "gen8_de_irq_handler" > /sys/kernel/debug/tracing/set_ftrace_filter
echo function > /sys/kernel/debug/tracing/current_tracer
echo 0 > /sys/kernel/debug/tracing/events/enable
echo 1 > /sys/kernel/debug/tracing/events/i915/i915_reg_rw/enable
echo 1 > /sys/kernel/debug/tracing/tracing_on

And once you've seen the flicker, dump the trace into a log file.
cat /sys/kernel/debug/tracing/trace > trace.log
Comment 30 Johannes Berg 2018-12-04 10:07:44 UTC
Ok, I'll try to do that - in the meantime I'm attaching the parsed version of the trace, but it looks like the function stuff didn't get recorded?!
Comment 31 Johannes Berg 2018-12-04 10:08:20 UTC
Created attachment 142709 [details]
trace.dat parsed
Comment 32 Karthik B S 2018-12-04 10:22:49 UTC
(In reply to Johannes Berg from comment #30)
> Ok, I'll try to do that - in the meantime I'm attaching the parsed version
> of the trace, but it looks like the function stuff didn't get recorded?!

Yea, looks like it didn't get recorded. I believe it would be easier to pin point the set of reg read/write which actually caused the error if we have the function calls recorded as well.
Comment 33 Johannes Berg 2018-12-04 10:23:49 UTC
(In reply to Karthik B S from comment #32)

> Yea, looks like it didn't get recorded. I believe it would be easier to pin
> point the set of reg read/write which actually caused the error if we have
> the function calls recorded as well.

Yep, fair enough. I'll work on it later, too busy with other things now to reboot etc.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.