Bug 108462

Summary:

two external screens permanently go blank on HP EliteBook Folio G1

Product:

DRI

Reporter:

Johannes Berg <johannes>

Component:

DRM/Intel

Assignee:

Karthik B S <karthik.b.s>

Status:

RESOLVED MOVED

QA Contact:

Intel GFX Bugs mailing list <intel-gfx-bugs>

Severity:

major

Priority:

high

CC:

goodmirek, imre.deak, intel-gfx-bugs, nutello, russianneuromancer, shtetldik, thomas, ville.syrjala

Version:

unspecified

Keywords:

regression

Hardware:

Other

OS:

All

Whiteboard:

Triaged, ReadyForDev

i915 platform:

SKL

i915 features:

display/watermark

Attachments:

Description	Flags
vbios dump	none
dmesg	none
register dump	none
video showing the issue	none
dmesg covering just one or two instances of the problem	none
config without this problem	none
config that doesn't boot	none
trace-cmd recording of i915_reg_rw	none
trace.dat parsed	none
requested trace.log	none
continuous trace log	none
continuous trace log - both screens going blank	none
kernel config file with the issue	none
requested dmesg from new drm-tip commit 244c5c8116c0042d61455697a9d85e899e2d9267	none
requested dmesg from new drm-tip commit 244c5c8116c0042d61455697a9d85e899e2d9267 (with drm.debug=0x1e)	none
Debug log with underrun errors	none

Description Johannes Berg 2018-10-16 19:49:51 UTC

DRM tip tree, commit 90b59df999a1 ("drm-tip: 2018y-10m-15d-20h-57m-27s UTC integration manifest")

Fedora 28 system on an HP EliteBook Folio G1 with Intel(R) Core(TM) m7-6Y75 CPU, x86_64 of course.

I have two external monitors connected:
 * one directly by USB-C display port cable
 * one via docking station (https://www.iogear.com/product/GUD3C01/), connected with HDMI

Both screens frequently go completely blank. Of course now while I was waiting for it with the camera turned on it didn't happen, but basically it just all goes completely black and the displays turn off the backlight temporarily.

This is a regression, but I cannot exactly say when it was introduced. I know it works on 4.13 (which I installed because of this bug...), and I believe it still worked on 4.15, but I don't have that installed now to test.

Comment 1 Johannes Berg 2018-10-16 19:50:11 UTC

Created attachment 142053 [details]
vbios dump

Comment 2 Johannes Berg 2018-10-16 19:51:19 UTC

Created attachment 142054 [details]
dmesg

Note that these attachments are the same as for bug 108460, both problems occurred there, just at different times.

Comment 3 Johannes Berg 2018-10-16 19:52:40 UTC

Created attachment 142055 [details]
register dump

Comment 4 Johannes Berg 2018-10-16 19:56:09 UTC

Created attachment 142056 [details]
video showing the issue

Comment 5 Johannes Berg 2018-10-16 19:56:43 UTC

FWIW, occasionally it only happens to one screen ...

Comment 6 Johannes Berg 2018-10-16 19:57:13 UTC

Created attachment 142057 [details]
dmesg covering just one or two instances of the problem

Comment 7 Lakshmi 2018-10-17 12:13:44 UTC

(In reply to Johannes Berg from comment #0)
> DRM tip tree, commit 90b59df999a1 ("drm-tip: 2018y-10m-15d-20h-57m-27s UTC
> integration manifest")
> 
> Fedora 28 system on an HP EliteBook Folio G1 with Intel(R) Core(TM) m7-6Y75
> CPU, x86_64 of course.
> 
> I have two external monitors connected:
>  * one directly by USB-C display port cable
>  * one via docking station (https://www.iogear.com/product/GUD3C01/),
> connected with HDMI
> 
> Both screens frequently go completely blank. Of course now while I was
> waiting for it with the camera turned on it didn't happen, but basically it
> just all goes completely black and the displays turn off the backlight
> temporarily.

How the screen comes back? Any particular actions will make the screen turn on again?

> This is a regression, but I cannot exactly say when it was introduced. I
> know it works on 4.13 (which I installed because of this bug...), and I
> believe it still worked on 4.15, but I don't have that installed now to test.

Comment 8 Johannes Berg 2018-10-17 12:16:21 UTC

(In reply to Lakshmi from comment #7)

> > Both screens frequently go completely blank. Of course now while I was
> > waiting for it with the camera turned on it didn't happen, but basically it
> > just all goes completely black and the displays turn off the backlight
> > temporarily.
> 
> How the screen comes back? Any particular actions will make the screen turn
> on again?

Oh, they just come back automatically and pretty much immediately, but it's super annoying to work with a system that just decides to turn your screen off and on occasionally :-)

Comment 9 Imre Deak 2018-10-17 16:32:35 UTC

The log at
https://bugs.freedesktop.org/attachment.cgi?id=142054
has pipe underruns on all 3 pipes, so I suspect some watermark problem.

The log at
https://bugs.freedesktop.org/attachment.cgi?id=142057
doesn't have any obvious issues, but that could just be due to underrun reporting being disabled at that time.

Any chance that you could do a bisect?

Comment 10 Johannes Berg 2018-10-17 19:07:26 UTC

(In reply to Imre Deak from comment #9)
> The log at
> https://bugs.freedesktop.org/attachment.cgi?id=142054
> has pipe underruns on all 3 pipes, so I suspect some watermark problem.
> 
> The log at
> https://bugs.freedesktop.org/attachment.cgi?id=142057
> doesn't have any obvious issues, 

I'm pretty sure that the second log had the issue at least once.

> but that could just be due to underrun
> reporting being disabled at that time.

but I suppose that's possible.

> Any chance that you could do a bisect?

Technically yes, since I know it was fine around 4.15 time-frame, but it'll take ... forever, especially on this machine. Any other ideas would be nicer... :-)

Comment 11 Johannes Berg 2018-10-17 19:11:55 UTC

> Technically yes, since I know it was fine around 4.15 time-frame, but it'll
> take ... forever, especially on this machine. Any other ideas would be
> nicer... :-)

That said, any idea which paths I can restrict the bisect to? Maybe I'll try to run it at some point.

Comment 12 Johannes Berg 2018-10-18 09:56:22 UTC

Ok... I started to bisect, but instead of compiling the fedora config I used "make localmodconfig". I'm on 4.17-rc5 now and the issue isn't happening, though I was reasonably sure that it would happen here. I'm compiling 4.19-rc again with my current config to see if it's just the config ... or if it reproduces there.

Any ideas how the config might affect it?

Like I said, I'm not 100% certain it previously occurred on 4.17 with Fedora config, but I thought it did.

Comment 13 Johannes Berg 2018-10-18 10:44:49 UTC

Ok, hmmm. This does seem to depend on the kernel .config, now with the current config ("make localmodconfig") on the same DRM tip tree (commit 90b59df999a1) it hasn't happened yet in a few minutes, which would be almost impossible with the broken kernel...

One (perhaps significant) difference that I notice it that this kernel now shows the 4 boot-time penguins, which is not the case on Fedora's config.

Any thoughts as to what Kconfig knobs might affect this that I can play with?

I can't really bisect if it's a Kconfig issue, and only happens on Fedora's config - that's too big to bisect with. If I can reproduce with a smaller config (and then not reproduce on older kernels) I can attempt the bisect again.

Comment 14 Imre Deak 2018-10-18 15:34:05 UTC

Not sure what Kconfig option would affect this issue.

As another approach to narrow down the problem, could you try - right after triggering the problem - disabling the low-power fifo mode and see if the problem is still reproducible? Please also provide the output for the script:

# cd /sys/kernel/debug/dri/0
# for plane in pri cur spr; do
>       cat i915_${plane}_wm_latency
>       wm0=$(head -1 i915_${plane}_wm_latency|cut -d' ' -f2)
>       echo $wm0 1000 1000 1000 1000 > i915_${plane}_wm_latency
> done

Comment 15 Lakshmi 2018-10-24 10:17:41 UTC

Johannes, Have you tried Imre's suggestion?

Comment 16 Johannes Berg 2018-10-24 10:19:15 UTC

(In reply to Lakshmi from comment #15)
> Johannes, Have you tried Imre's suggestion?

Not yet, unfortunately - I'd still been trying to figure out why my new kernel .config doesn't exhibit the issue, but yeah, I should do that. Perhaps tonight, when I'm off work.

Comment 17 Johannes Berg 2018-10-27 19:13:33 UTC

FWIW, I don't actually know of a way of *triggering* this. It seems to just happen all by itself, sometimes with lots of screen activity, sometimes with none at all.

I noticed that this message *sometimes* seems to coincide with the issue:

[drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe C FIFO underrun

but I suppose that's just be a symptom of the issue, rather than the cause, since it doesn't *always* happen.


Your script doesn't actually work:

jberg1-mobl2:/sys/kernel/debug/dri/0# for plane in pri cur spr; do
> cat i915_${plane}_wm_latency
> wm0=$(head -1 i915_${plane}_wm_latency|cut -d' ' -f2)
> echo $wm0 1000 1000 1000 1000 > i915_${plane}_wm_latency
> done
WM0 2 (2.0 usec)
WM1 19 (19.0 usec)
WM2 28 (28.0 usec)
WM3 32 (32.0 usec)
WM4 63 (63.0 usec)
WM5 77 (77.0 usec)
WM6 83 (83.0 usec)
WM7 99 (99.0 usec)
bash: echo: write error: Invalid argument
WM0 2 (2.0 usec)
WM1 19 (19.0 usec)
WM2 28 (28.0 usec)
WM3 32 (32.0 usec)
WM4 63 (63.0 usec)
WM5 77 (77.0 usec)
WM6 83 (83.0 usec)
WM7 99 (99.0 usec)
bash: echo: write error: Invalid argument
WM0 2 (2.0 usec)
WM1 19 (19.0 usec)
WM2 28 (28.0 usec)
WM3 32 (32.0 usec)
WM4 63 (63.0 usec)
WM5 77 (77.0 usec)
WM6 83 (83.0 usec)
WM7 99 (99.0 usec)
bash: echo: write error: Invalid argument

Comment 18 Imre Deak 2018-10-27 20:48:38 UTC

(In reply to Johannes Berg from comment #17)
> FWIW, I don't actually know of a way of *triggering* this. It seems to just
> happen all by itself, sometimes with lots of screen activity, sometimes with
> none at all.
> 
> I noticed that this message *sometimes* seems to coincide with the issue:
> 
> [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe C FIFO
> underrun
> 
> but I suppose that's just be a symptom of the issue, rather than the cause,
> since it doesn't *always* happen.

It could be a marker that we program wrong watermark levels (which I'd like to test with my script).

> Your script doesn't actually work:
> 
> jberg1-mobl2:/sys/kernel/debug/dri/0# for plane in pri cur spr; do
> > cat i915_${plane}_wm_latency
> > echo $wm 1000 1000 1000 1000 > i915_${plane}_wm_latency
> > done

Ah sorry, you have SKL and so 8 watermark levels not 5. The following should work better, could you try it? Forgot to say, but after running the script you also have to force a display modeset, for example by making both displays blank then unblank. Then just leave it running and see if any FIFO underrun message shows up or if the displays flicker. Thanks.

# for plane in pri cur spr; do
> echo 20 500 500 500 500 500 500 500 > i915_${plane}_wm_latency
> done

Comment 19 Johannes Berg 2018-10-27 20:54:43 UTC

Created attachment 142237 [details]
config without this problem

Comment 20 Johannes Berg 2018-10-27 21:04:17 UTC

Created attachment 142238 [details]
config that doesn't boot

So I've been trying to figure out why one .config works, and another doesn't.

In the process, I've arrived at the previously attached .config (config-working.txt) that doesn't exhibit the issue, but I don't have working sound. Note that this config also doesn't exhibit bug 108462.

Now, since I was trying to _also_ make sound work (since I'm actually fairly happy to have a working .config, except it didn't have sound which is annoying), I've slowly been enabling sound options. My previously attached config-working.txt had some sound options enabled, but still no sound.

Now ... with config-notbooting.txt that I just attached, I've really only changed sound configuration:

+CONFIG_REGMAP_I2C=m
+CONFIG_SND_HDA_EXT_CORE=m
+CONFIG_SND_SOC=m
+CONFIG_SND_SOC_TOPOLOGY=y
+CONFIG_SND_SOC_ACPI=m
+CONFIG_SND_SOC_INTEL_SST_TOPLEVEL=y
+CONFIG_SND_SOC_INTEL_SST=m
+CONFIG_SND_SOC_INTEL_SKYLAKE=m
+CONFIG_SND_SOC_ACPI_INTEL_MATCH=m
+CONFIG_SND_SOC_INTEL_MACH=y
+CONFIG_SND_SOC_I2C_AND_SPI=m

REGMAP_I2C got pulled in extra, apparently.

Here's the weird part: this .config doesn't even boot properly. The external screens are not initialized at all, and even the internal panel doesn't get the right resolution in the fedora boot/disk password screen!! It also doesn't boot fully, I get to enter my password but it doesn't get to the display manager.

Looks like I have a choice between
 * kernel with working graphics and no sound
 * kernel with flickering graphics but sound
 * kernel 4.13 (or perhaps 4.15?)

Can anyone explain to me why selecting the sound options should have any impact on the graphics? Clearly it does though ... I can file a separate bug on that though, if you prefer.

Comment 21 Johannes Berg 2018-10-27 21:05:05 UTC

(In reply to Johannes Berg from comment #20)

> In the process, I've arrived at the previously attached .config
> (config-working.txt) that doesn't exhibit the issue, but I don't have
> working sound. Note that this config also doesn't exhibit bug 108462.

Sorry, I meant bug 108460.

johannes

Comment 22 Johannes Berg 2018-10-27 21:26:43 UTC

> Ah sorry, you have SKL and so 8 watermark levels not 5. The following should
> work better, could you try it? Forgot to say, but after running the script
> you also have to force a display modeset, for example by making both
> displays blank then unblank. Then just leave it running and see if any FIFO
> underrun message shows up or if the displays flicker. Thanks.

It looks like that did indeed help. It's been running for a few minutes without showing the error, and that would've been highly unlikely with the situation before. Still on the same DRM commit mentioned in comment #1, fwiw.

Tell me how _this_ is related to kernel .config though?

Comment 23 Lakshmi 2018-11-02 08:37:29 UTC

Imre, any comments here?

Comment 24 Imre Deak 2018-11-02 13:41:33 UTC

(In reply to Lakshmi from comment #23)
> Imre, any comments here?

I think we should check for missing SKL workarounds related to watermark programming. Ville has said that we are missing a few of those.

Comment 25 Lakshmi 2018-11-14 11:05:26 UTC

Ville, any changes are pushed to drm-tip that helps this issue?

Comment 26 Karthik B S 2018-11-15 10:43:30 UTC

Hi Johannes,

I tried to reproduce the issue at my end with ubuntu16.04 using DRM-TIP(4.20_rc1) kernel, with 2 displays(eDP+HDMI) connected.
I also set the audio parameters in the config file as mentioned in the bug, but I'm unable to reproduce the issue.
Could you please provide the ftrace together with register trace enabled.
(echo 1 > /sys/kernel/debug/tracing/events/i915/i915_reg_rw/enable)

Comment 27 russianneuromancer 2018-11-16 10:28:01 UTC

> Ah sorry, you have SKL and so 8 watermark levels not 5. The following should work better, could you try it?

Imre, your workaround script from Comment 18 helps with bug 103229 (internal screen flicker on same laptop).

Karthik and Imre, if possible, could you please look into bug 103229?

Comment 28 Johannes Berg 2018-11-21 07:32:26 UTC

Created attachment 142533 [details]
trace-cmd recording of i915_reg_rw

Sorry for the delay, Karthik, here's the trace you requested. I think. Only one of the screens went blank towards the end of the file.

If you have something else in mind, I'd appreciate a full trace-cmd record command line.

FWIW, I'm not surprised you're not able to reproduce this, I myself am having a very hard time reproducing on a kernel that doesn't use fedora's configuration.

Comment 29 Karthik B S 2018-12-04 10:03:00 UTC

Hi,

Sorry for the delay in reply.
I actually tried to reproduce the bug at our end multiple times, but have not been successful till now.
Also I'm having some issue with the .dat file format, the file I have seems partially corrupted. A .txt file would suffice.
I've narrowed the ftrace to 4 functions so that the buffer doesn't get overwritten.

Could you please run the below steps.
echo 0 > /sys/kernel/debug/tracing/tracing_on
echo nop > /sys/kernel/debug/tracing/current_tracer
echo "intel_atomic_commit" "intel_atomic_commit_tail" "intel_cpu_fifo_underrun_irq_handler" "gen8_de_irq_handler" > /sys/kernel/debug/tracing/set_ftrace_filter
echo function > /sys/kernel/debug/tracing/current_tracer
echo 0 > /sys/kernel/debug/tracing/events/enable
echo 1 > /sys/kernel/debug/tracing/events/i915/i915_reg_rw/enable
echo 1 > /sys/kernel/debug/tracing/tracing_on

And once you've seen the flicker, dump the trace into a log file.
cat /sys/kernel/debug/tracing/trace > trace.log

Comment 30 Johannes Berg 2018-12-04 10:07:44 UTC

Ok, I'll try to do that - in the meantime I'm attaching the parsed version of the trace, but it looks like the function stuff didn't get recorded?!

Comment 31 Johannes Berg 2018-12-04 10:08:20 UTC

Created attachment 142709 [details]
trace.dat parsed

Comment 32 Karthik B S 2018-12-04 10:22:49 UTC

(In reply to Johannes Berg from comment #30)
> Ok, I'll try to do that - in the meantime I'm attaching the parsed version
> of the trace, but it looks like the function stuff didn't get recorded?!

Yea, looks like it didn't get recorded. I believe it would be easier to pin point the set of reg read/write which actually caused the error if we have the function calls recorded as well.

Comment 33 Johannes Berg 2018-12-04 10:23:49 UTC

(In reply to Karthik B S from comment #32)

> Yea, looks like it didn't get recorded. I believe it would be easier to pin
> point the set of reg read/write which actually caused the error if we have
> the function calls recorded as well.

Yep, fair enough. I'll work on it later, too busy with other things now to reboot etc.

Comment 34 Johannes Berg 2018-12-12 07:27:11 UTC

Created attachment 142784 [details]
requested trace.log

Sorry for the delay, finally here's the requested trace.log.

I can't really understand anything from it though, tbh :-)

Comment 35 Karthik B S 2018-12-21 09:25:08 UTC

Hi,

We have logs only for some 20s, looks like the buffer is getting overwritten. So I'm not getting the function call I wanted. 
Assuming that the error occurred at some point, I'm looking for "intel_cpu_fifo_underrun_irq_handler" and the commit just before this, causing the underrun.
I tried to debug regardless, but it is a very difficult to find the commit causing this issue considering that there are 500 odd commits in the trace.

Can you please try having a script running in background to keep dumping the trace to a log file(or may be different files if there's too much logs)
just to ensure we're not missing any trace and with this we'll try to catch the function call for fifo underrun.

Comment 36 Johannes Berg 2019-01-04 21:38:11 UTC

Created attachment 142977 [details]
continuous trace log

Here's a file captured via trace_pipe. Only one of the screens flickered during this time, I'll see if I can capture one where both do.

Comment 37 Johannes Berg 2019-01-04 21:47:45 UTC

(In reply to Johannes Berg from comment #36)
> Created attachment 142977 [details]
> continuous trace log
> 
> Here's a file captured via trace_pipe. Only one of the screens flickered
> during this time, I'll see if I can capture one where both do.

Note that I *didn't* see a corresponding "FIFO underrun" message in dmesg for this ... so maybe the reason here is something else?

Comment 38 Johannes Berg 2019-01-04 22:17:58 UTC

Created attachment 142980 [details]
continuous trace log - both screens going blank

Comment 39 Johannes Berg 2019-01-05 07:37:25 UTC

Apologies for shifting the goal-posts, but I just realized that these latest traces were actually captured on 4.19.13-200.fc28.x86_64 (Fedora), rather than the DRM tip or the original Fedora kernel. Let me know if you need me to reproduce on a particular kernel version.

Comment 40 Karthik B S 2019-01-17 05:30:49 UTC

I went through both the logs and checked the DDB /watermark register write's. They look fine, although I see the WM registers only for Pipe A.
Somehow it seems like many reg read/write's are missing.
Also I don't see any read/write of the Plane control registers and without that it would not be possible to verify the correctness of DDB allocation. 

I would need to bug you for more logs again and this would be never ending, instead I think it would be better if I'm able to reproduce the issue locally. So we'll give one last try for the same.

Firstly it would be good for me if it would be possible to reproduce to the issue on DRM-TIP. And also please share the config file you used with the kernel together with the resolutions of both the displays.
And I believe the issue is only reproduced on the display config mentioned by you right?
    * one directly by USB-C display port cable
    * one via docking station connected with HDMI

Hopefully I'm able to reproduce the issue locally time time and then root cause the issue asap.

Comment 41 Ville Syrjala 2019-01-17 14:56:57 UTC

(In reply to Karthik B S from comment #40)
> I went through both the logs and checked the DDB /watermark register
> write's. They look fine, although I see the WM registers only for Pipe A.
> Somehow it seems like many reg read/write's are missing.
> Also I don't see any read/write of the Plane control registers and without
> that it would not be possible to verify the correctness of DDB allocation. 

We no longer have the tracepoint for most plane register writes. It was (somewhat unintentionally) removed as part of dd584fc0711a ("drm/i915: Use I915_READ_FW for plane updates").

Comment 42 Lakshmi 2019-02-26 10:40:28 UTC

Karthik, any further updates here?

Comment 43 Karthik B S 2019-02-26 10:51:46 UTC

(In reply to Lakshmi from comment #42)
> Karthik, any further updates here?

No, we're not able to reproduce it locally.
As mentioned earlier, it would be helpful if we are able get config file used and the details for reproducing the bug, as logs aren't helping much.

Comment 44 Lakshmi 2019-02-27 07:22:10 UTC

Johannes, can you please attach the config file which caused the issue as mentioned in the description? Also, can you mention the display resolution that was set when the issue occurred?

Comment 45 Shmerl 2019-03-12 19:28:33 UTC

I have a seemingly related bug, when switching to tty causes system to hang for some period of time. Sometimes tty comes up, and you can see

[drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe A FIFO underrun

in dmesg.

Configuration:
Dell Latitude 7490
GPU: Intel UHD Graphics 620 (Kabylake GT2).
Kernel: 4.20.0-trunk-amd64 (from Debian experimental).

Displays: 2 external monitors, one connected over USB-C (display port connection), and other daisy chained from the first through DisplayPort cable. I keep laptop screen off, so using only two external monitors (KDE Plasma 5.14.5).

You can find config here:
https://salsa.debian.org/kernel-team/linux/blob/debian/4.20-1_exp1/debian/config/config

Often the freeze is hard, and the only way to unfreeze it is Alt+SysRq+REISUB or complete hard reboot.

Comment 46 Shmerl 2019-03-12 19:34:43 UTC

Some more details for my case above.

External monitors resolution (both): 2560x1440
Laptop screen resolution: 1920x1080 (kept off).

Displays are daisy chained like this: Laptop -> (USB-C) -> Dell U2715H -> (DP) -> Dell U2713HM. 2560x1440

DisplayPort 1.2 mode is enabled on Dell U2715H, while Dell U2713HM doesn't have such setting in on-screen UI.

Comment 47 Shmerl 2019-03-13 03:58:42 UTC

To add to the above, this also happens sometimes when external monitors simply go to sleep normally due to inactivity (after set KDE time period). After that it's not possible to wake them up (or even laptop screen), without rebooting the system.

Comment 48 Lakshmi 2019-03-15 08:58:34 UTC

Johannes, can you please attach the config file which caused the issue as mentioned in the description? Can you address Comment 43?

This information would be very helpful to proceed with the investigation.

Comment 49 Johannes Berg 2019-03-15 09:08:09 UTC

I'm sorry, I've been completely swamped.

I'll attach the config file, as for other details, I think you asked for details on the configuration.

So, I have:
 1) A Dell U2312HM running at 1920x1080 @ 60Hz
 2) A Dell U2311H  running at 1920x1080 @ 60Hz
 3) internal display running at 1920x1080

Both are connected using DP, but one is connected directly to the laptop (USB-C connector) and the other is connected via a dock as mentioned before. I said before it was connected on HDMI, but it is connected on DP now. Didn't change anything.

I've been running with the workaround:

for plane in pri cur spr; do echo 20 500 500 500 500 500 500 500 > i915_${plane}_wm_latency ; done

which makes it not be an issue for me.

Comment 50 Johannes Berg 2019-03-15 09:09:22 UTC

Created attachment 143673 [details]
kernel config file with the issue

This is the config file I use right now. Previously, when I rebuilt drm-tip, I was able to reproduce it with this config, but not with an arbitrary locally generated one.

Comment 51 Shmerl 2019-03-18 15:19:23 UTC

(In reply to Johannes Berg from comment #49)
> I've been running with the workaround:
> 
> for plane in pri cur spr; do echo 20 500 500 500 500 500 500 500 >
> i915_${plane}_wm_latency ; done
> 
> which makes it not be an issue for me.

What is the correct format for setting those values? That's what I see there now:

for plane in pri cur spr; do cat /sys/kernel/debug/dri/0/i915_${plane}_wm_latency; done

WM0 2 (2.0 usec)
WM1 19 (19.0 usec)
WM2 28 (28.0 usec)
WM3 32 (32.0 usec)
WM4 63 (63.0 usec)
WM5 77 (77.0 usec)
WM6 83 (83.0 usec)
WM7 99 (99.0 usec)
WM0 2 (2.0 usec)
WM1 19 (19.0 usec)
WM2 28 (28.0 usec)
WM3 32 (32.0 usec)
WM4 63 (63.0 usec)
WM5 77 (77.0 usec)
WM6 83 (83.0 usec)
WM7 99 (99.0 usec)
WM0 2 (2.0 usec)
WM1 19 (19.0 usec)
WM2 28 (28.0 usec)
WM3 32 (32.0 usec)
WM4 63 (63.0 usec)
WM5 77 (77.0 usec)
WM6 83 (83.0 usec)
WM7 99 (99.0 usec)

That doesn't match the format in your example above.

Comment 52 Shmerl 2019-03-25 15:52:00 UTC

I tried this workaround:

    for plane in pri cur spr; do echo 20 500 500 500 500 500 500 500 > i915_${plane}_wm_latency ; done

But it's causing problems in my setup, (daisy chained monitor becomes unstable, and flickers).

I had to revert to previous values:

    for plane in pri cur spr; do echo 2 19 28 32 63 77 83 99 > i915_${plane}_wm_latency ; done

Comment 53 Karthik B S 2019-04-03 05:30:57 UTC

(In reply to Johannes Berg from comment #49)
> I'm sorry, I've been completely swamped.
> 
> I'll attach the config file, as for other details, I think you asked for
> details on the configuration.
> 
> So, I have:
>  1) A Dell U2312HM running at 1920x1080 @ 60Hz
>  2) A Dell U2311H  running at 1920x1080 @ 60Hz
>  3) internal display running at 1920x1080
> 
> Both are connected using DP, but one is connected directly to the laptop
> (USB-C connector) and the other is connected via a dock as mentioned before.
> I said before it was connected on HDMI, but it is connected on DP now.
> Didn't change anything.
> 
> I've been running with the workaround:
> 
> for plane in pri cur spr; do echo 20 500 500 500 500 500 500 500 >
> i915_${plane}_wm_latency ; done
> 
> which makes it not be an issue for me.

Tried to repro the bug with 2 external DP panels at 2k@60 and one eDP panel at 2k@60, using the config file provided on DRM-TIP(5.0.0-rc5+). Still not seeing any under runs.

I tried video playback on 2 displays for 4-5 hours and also the suspend resume scenario without any success.

Any particular workload or sequence you can suggest, which might cause this issue in particular?

Comment 54 Karthik B S 2019-04-03 05:43:47 UTC

(In reply to Shmerl from comment #45)
> I have a seemingly related bug, when switching to tty causes system to hang
> for some period of time. Sometimes tty comes up, and you can see
> 
> [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe A FIFO
> underrun
> 
> in dmesg.
> 
> Configuration:
> Dell Latitude 7490
> GPU: Intel UHD Graphics 620 (Kabylake GT2).
> Kernel: 4.20.0-trunk-amd64 (from Debian experimental).
> 
> Displays: 2 external monitors, one connected over USB-C (display port
> connection), and other daisy chained from the first through DisplayPort
> cable. I keep laptop screen off, so using only two external monitors (KDE
> Plasma 5.14.5).
> 
> You can find config here:
> https://salsa.debian.org/kernel-team/linux/blob/debian/4.20-1_exp1/debian/
> config/config
> 
> Often the freeze is hard, and the only way to unfreeze it is
> Alt+SysRq+REISUB or complete hard reboot.

Yet to try this out on Kabylake, will check it with the mentioned config and get back.

Comment 55 Johannes Berg 2019-04-03 19:13:09 UTC

(In reply to Karthik B S from comment #53)

> Tried to repro the bug with 2 external DP panels at 2k@60 and one eDP panel
> at 2k@60, using the config file provided on DRM-TIP(5.0.0-rc5+). Still not
> seeing any under runs.

Hmm. Maybe it's as machine-specific thing? Or maybe it's important that there's the DP-hub inbetween?

> I tried video playback on 2 displays for 4-5 hours and also the suspend
> resume scenario without any success.
> 
> Any particular workload or sequence you can suggest, which might cause this
> issue in particular?

Nope, sorry. For me it usually starts showing up as soon as I log in, sometimes before, sometimes a little later. But it's not even that I need to play video or do something else. I'm on Fedora but still run X (not wayland) with gnome shell, but that's it...

Comment 56 russianneuromancer 2019-05-25 04:59:08 UTC

> Maybe it's as machine-specific thing?

With bug 103229 this indeed looks like machine-specific thing.

Question regarding bugreport status NEEDINFO - what other information is needed?

Comment 57 Lakshmi 2019-07-13 18:13:51 UTC

@Karthik, any updates on this issue?

Comment 58 Karthik B S 2019-07-23 03:04:56 UTC

(In reply to Lakshmi from comment #57)
> @Karthik, any updates on this issue?

Unfortunately no. The last update is that I'm not able to reproduce it.
After all the futile attempts at reproducing the issue, it looks like this is a machine specific issue.

Comment 59 russianneuromancer 2019-07-23 05:24:12 UTC

> it looks like this is a machine specific issue

What is further action is possible in this case? Is it possible to request HP to get access to certain hardware? It's seems like this was done for example in this case: https://bugzilla.kernel.org/show_bug.cgi?id=201579#c8

Comment 60 Johannes Berg 2019-07-26 07:56:35 UTC

(In reply to Karthik B S from comment #58)
> (In reply to Lakshmi from comment #57)
> > @Karthik, any updates on this issue?
> 
> Unfortunately no. The last update is that I'm not able to reproduce it.
> After all the futile attempts at reproducing the issue, it looks like this
> is a machine specific issue.

Let's take this internally then (look me up in the Intel address book), maybe we have more such machines and can provide one, or maybe I can swap machines and send you this one, or something like that.

Or maybe I can put the system on the VPN and let you look at it that way.

Comment 61 Karthik B S 2019-07-26 09:31:12 UTC

(In reply to Johannes Berg from comment #60)
> (In reply to Karthik B S from comment #58)
> > (In reply to Lakshmi from comment #57)
> > > @Karthik, any updates on this issue?
> > 
> > Unfortunately no. The last update is that I'm not able to reproduce it.
> > After all the futile attempts at reproducing the issue, it looks like this
> > is a machine specific issue.
> 
> Let's take this internally then (look me up in the Intel address book),
> maybe we have more such machines and can provide one, or maybe I can swap
> machines and send you this one, or something like that.
> 
> Or maybe I can put the system on the VPN and let you look at it that way.

Sure.

Comment 62 russianneuromancer 2019-07-26 16:03:08 UTC

Karthik, when you get access to hardware, please also look into Bug 111201 - this is another screen flicker issue specific to HP EliteBook Folio G1.

Comment 63 Shmerl 2019-07-26 16:04:39 UTC

If you have a chance, check also Dell Latitude 7490 please, which I mentioned above. It also has the same issue.

Comment 64 Johannes Berg 2019-07-27 23:10:19 UTC

I'm almost done doing a long and tiring config bisect now, using Steven's awesome script tools/testing/ktest/config-bisect.pl ...

But one thing I just noticed, on the drm-tip kernel I got this WARN_ON a lot?

[  289.339478] WARNING: CPU: 2 PID: 2200 at drivers/gpu/drm/i915/intel_pm.c:4395 skl_allocate_pipe_ddb+0xa2f/0xb60 [i915]
[  289.339481] ---[ end trace f63ed9bc71cfc7ae ]---
[  289.339496] ------------[ cut here ]------------
[  289.339498] WARN_ON(wm->wm[level].min_ddb_alloc > total[PLANE_CURSOR])


it's possible, however, that this occurred because I set the watermark levels as per a comment above...

Comment 65 Johannes Berg 2019-07-29 12:06:04 UTC

The warnings seem quite possibly unrelated, but this is a new print I hadn't seen before:

[ 5138.452734] [drm] HPD interrupt storm detected on connector eDP-1: switching from hotplug detection to polling


(but then again, I never let it flicker for this long!)

Comment 66 Johannes Berg 2019-07-29 12:14:09 UTC

Ah, the warnings do happen when I enter the script from comment #18...

Comment 67 Shmerl 2019-08-08 15:35:17 UTC

Just updated to kernel 5.2.6 on my Dell Latitude 7490, also using latest i915 firmware. This hang still happens, as soon as KDE puts monitors to sleep, they never wake up after that, and I need to reboot the computer.

Comment 68 Lakshmi 2019-08-09 11:12:36 UTC

(In reply to Shmerl from comment #67)
> Just updated to kernel 5.2.6 on my Dell Latitude 7490, also using latest
> i915 firmware. This hang still happens, as soon as KDE puts monitors to
> sleep, they never wake up after that, and I need to reboot the computer.

Can you attach the dmesg from boot with kernel parameters drm.debug=0x1e log_buf_len=4M?

Comment 69 Shmerl 2019-08-26 20:39:42 UTC

(In reply to Lakshmi from comment #68)
>
> Can you attach the dmesg from boot with kernel parameters drm.debug=0x1e
> log_buf_len=4M?

Just installed kernel 5.3-rc5, and looks like sleep mode for the monitors worked OK at least one time. I'll see if it works consistently now, and will comment. If it hangs again, I'll post dmesg. Thanks!

Comment 70 Shmerl 2019-08-28 15:11:10 UTC

Yep, after using for a while, I can confirm that the issue is gone for me (kernel 5.3-rc5).

Comment 71 Lakshmi 2019-08-29 08:22:56 UTC

(In reply to Shmerl from comment #70)
> Yep, after using for a while, I can confirm that the issue is gone for me
> (kernel 5.3-rc5).

Johannes, can you please confirm if this issue is still happening to you? If not, can I close this bug?

Comment 72 Johannes Berg 2019-08-29 13:44:51 UTC

(In reply to Lakshmi from comment #71)

> Johannes, can you please confirm if this issue is still happening to you? If
> not, can I close this bug?

I rebuilt drm-tip as of an hour ago or so (commit 244c5c8116c0) and the issue definitely *is* still happening.

Comment 73 Johannes Berg 2019-08-29 13:51:42 UTC

Actually, it got MUCH worse.

Not only do screens (all, including the internal) keep flickering/turning on and off with that kernel like before, but it also loses one of the external screens entirely.

What happens is that with this new kernel, going into gdm3 turns off both external displays. This does NOT happen with gdm3 with a different kernel, including rc1, there they are turned on (and keep flickering) in gdm3. When I log in, only one external display is turned on again.

Note that a similar behaviour happened before - if I were warm-rebooting the machine, only one external display would turn on after the reboot. A cold reboot turned on both external screens reliably. So the "can only turn on one of the two after two had been turned off" was already there, but what's new is that without any changes other than the kernel, gdm3 now turns off both external screens on the login screen.

Much worse, because now with this kernel I can only use one external screen at all (except during boot, well...)

Comment 74 Shmerl 2019-08-29 14:00:56 UTC

Just for the reference, I'm using KDE Plasma / sddm.

Comment 75 Lakshmi 2019-08-30 08:20:52 UTC

(In reply to Johannes Berg from comment #73)
> Actually, it got MUCH worse.
> 
> Not only do screens (all, including the internal) keep flickering/turning on
> and off with that kernel like before, but it also loses one of the external
> screens entirely.
> 
> What happens is that with this new kernel, going into gdm3 turns off both
> external displays. This does NOT happen with gdm3 with a different kernel,
> including rc1, there they are turned on (and keep flickering) in gdm3. When
> I log in, only one external display is turned on again.
> 
> Note that a similar behaviour happened before - if I were warm-rebooting the
> machine, only one external display would turn on after the reboot. A cold
> reboot turned on both external screens reliably. So the "can only turn on
> one of the two after two had been turned off" was already there, but what's
> new is that without any changes other than the kernel, gdm3 now turns off
> both external screens on the login screen.
> 
> Much worse, because now with this kernel I can only use one external screen
> at all (except during boot, well...)

Can you please attach the logs as you verified the issue on drmtip?

Comment 76 Johannes Berg 2019-08-30 08:23:29 UTC

(In reply to Lakshmi from comment #75)

> Can you please attach the logs as you verified the issue on drmtip?

Sure, but can you remind me which logs you'd want in this case, and how to capture them? Like in comments 26ff?

Comment 77 Johannes Berg 2019-08-30 08:24:25 UTC

Oh, and also, unless you can tell me how to capture this from boot, I'm not sure I could capture the gdm3 issue since it boots into that directly?

Comment 78 Lakshmi 2019-08-30 10:08:23 UTC

(In reply to Johannes Berg from comment #76)
> (In reply to Lakshmi from comment #75)
> 
> > Can you please attach the logs as you verified the issue on drmtip?
> 
> Sure, but can you remind me which logs you'd want in this case, and how to
> capture them? Like in comments 26ff?

For now Dmesg from boot is required. You can get it from cd /var/log.

Comment 79 Johannes Berg 2019-09-03 06:23:38 UTC

Created attachment 145242 [details]
requested dmesg from new drm-tip commit 244c5c8116c0042d61455697a9d85e899e2d9267

This is from drm-tip 244c5c8116c0042d61455697a9d85e899e2d9267 that I compiled the other day, sorry for the delay.

Nothing really stands out in the log though.

Remember this is now with the second external display turned off, as gdm3 disabled both external displays on this kernel and only one came back after logging in - nothing really is shown though in the log pertaining to this.

Comment 80 Lakshmi 2019-09-03 07:21:03 UTC

(In reply to Johannes Berg from comment #79)
> Created attachment 145242 [details]
> requested dmesg from new drm-tip commit
> 244c5c8116c0042d61455697a9d85e899e2d9267
> 
> This is from drm-tip 244c5c8116c0042d61455697a9d85e899e2d9267 that I
> compiled the other day, sorry for the delay.
> 
> Nothing really stands out in the log though.
> 
> Remember this is now with the second external display turned off, as gdm3
> disabled both external displays on this kernel and only one came back after
> logging in - nothing really is shown though in the log pertaining to this.

Thanks. I can see underruns from the logs. *ERROR* CPU pipe B FIFO underrun.
Can you please attach the same logs with kernel parameters drm.debug=0x1e log_buf_len=4M. This will show more information about he issue.

Comment 81 Shmerl 2019-09-03 16:18:20 UTC

Actually, while it's better the issue is not totally gone in my case. At least monitors wake up now most of the time, but one time I got only one monitor waking up and not another. I also see this in dmesg (relatively recent messages):

[155348.188820] [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe B FIFO underrun
[169094.717384] [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe A FIFO underrun
[331635.239381] [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe A FIFO underrun

I'll post detailed dmesg from boot with drm.debug=0x1e a bit later.

Comment 82 Johannes Berg 2019-09-04 06:42:34 UTC

Created attachment 145261 [details]
requested dmesg from new drm-tip commit 244c5c8116c0042d61455697a9d85e899e2d9267 (with drm.debug=0x1e)

Comment 83 Lakshmi 2019-09-06 11:30:32 UTC

(In reply to Johannes Berg from comment #82)
> Created attachment 145261 [details]
> requested dmesg from new drm-tip commit
> 244c5c8116c0042d61455697a9d85e899e2d9267 (with drm.debug=0x1e)

Thanks for gathering the logs from drmtip. There are some underruns here

7.734338] [drm:gen8_de_irq_handler [i915]] hotplug event received, stat 0x00200000, dig 0x10101011, pins 0x00000020, long 0x00000000
[    7.734378] [drm:intel_hpd_irq_handler [i915]] digital hpd port B - short
[    7.734437] [drm:intel_dp_hpd_pulse [i915]] got hpd irq on port B - short
[    7.736611] [drm:intel_dp_hpd_pulse [i915]] got esi 01 10 00
[    7.744269] [drm:intel_dp_hpd_pulse [i915]] got esi2 01 00 00
[    7.744307] [drm:intel_dp_hpd_pulse [i915]] got esi 01 00 00
[    7.750579] [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe A FIFO underrun
[    7.750676] [drm:intel_fbc_underrun_work_fn [i915]] Disabling FBC due to FIFO underrun.

@Ville, help here?

Comment 84 russianneuromancer 2019-09-08 17:11:14 UTC

Just want to remind - there is exactly same laptop in Bug 111201 with exactly same error in log:

[  152.025695] [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe A FIFO underrun
[  152.025858] [drm:intel_fbc_underrun_work_fn [i915]] Disabling FBC due to FIFO underrun.

Comment 85 Shmerl 2019-09-11 21:18:57 UTC

Created attachment 145335 [details]
Debug log with underrun errors

Attaching log from /var/log/debug with underrun messages. Produced after booting with drm.debug=0x1e

Comment 86 Lakshmi 2019-09-12 07:52:16 UTC

(In reply to Karthik B S from comment #61)
> (In reply to Johannes Berg from comment #60)
> > (In reply to Karthik B S from comment #58)
> > > (In reply to Lakshmi from comment #57)
> > > > @Karthik, any updates on this issue?
> > > 
> > > Unfortunately no. The last update is that I'm not able to reproduce it.
> > > After all the futile attempts at reproducing the issue, it looks like this
> > > is a machine specific issue.
> > 
> > Let's take this internally then (look me up in the Intel address book),
> > maybe we have more such machines and can provide one, or maybe I can swap
> > machines and send you this one, or something like that.
> > 
> > Or maybe I can put the system on the VPN and let you look at it that way.
> 
> Sure.

@Karthik, any further updates here? There are more updated logs attached under this bug.

Comment 87 Karthik B S 2019-09-17 02:26:53 UTC

I was trying to debug it on the system provided by Johannes, but couldn't make much progress,(In reply to Lakshmi from comment #86)
> (In reply to Karthik B S from comment #61)
> > (In reply to Johannes Berg from comment #60)
> > > (In reply to Karthik B S from comment #58)
> > > > (In reply to Lakshmi from comment #57)
> > > > > @Karthik, any updates on this issue?
> > > > 
> > > > Unfortunately no. The last update is that I'm not able to reproduce it.
> > > > After all the futile attempts at reproducing the issue, it looks like this
> > > > is a machine specific issue.
> > > 
> > > Let's take this internally then (look me up in the Intel address book),
> > > maybe we have more such machines and can provide one, or maybe I can swap
> > > machines and send you this one, or something like that.
> > > 
> > > Or maybe I can put the system on the VPN and let you look at it that way.
> > 
> > Sure.
> 
> @Karthik, any further updates here? There are more updated logs attached
> under this bug.

I tried to debug it over VPN on the system provided by Johannes, but couldn't make much progress. I'll look into the new logs and provide an update.

Comment 88 Shmerl 2019-09-17 18:42:02 UTC

Just updated to latest UEFI firmware for my Dell Latitude 7490. The problem is still preset. Just got such error in dmesg (this time there is an additional detail about DisplayPort payload):

  877.679147] [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe B FIFO underrun
[  885.611061] [drm:intel_mst_enable_dp [i915]] *ERROR* Timed out waiting for ACT sent
[  901.774749] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to get link status
[  907.540929] [drm:intel_mst_disable_dp [i915]] *ERROR* failed to update payload -22

Comment 89 Shmerl 2019-09-17 18:43:07 UTC

Could be because I was plugging in and unplugging USB-C cable that routes DP signal to two external monitors. Sometimes it helps to work around the problem.

Comment 90 Johannes Berg 2019-10-08 19:17:15 UTC

Ping, anything we should do here?

If you happen to have any colleagues in Israel who might be able to take a look - I'll be going there in a month or so, ping me internally.

Comment 91 russianneuromancer 2019-10-08 19:26:05 UTC

Interesting idea. I will visit Germany in a couple of months. Is anyone from Intel can take a look at issue described in Bug 111201 in Germany in December?

Comment 92 Johannes Berg 2019-10-08 20:17:35 UTC

Well, I live in Germany and work for Intel, but so far that hasn't helped me ;-)

Comment 93 Jani Saarinen 2019-11-26 15:08:20 UTC

There has been lately changes with DP-MST so would you be able to test with latest drm-tip and add logs from drm-tip from dmesg with drm.debug=0x1e log_buf_len=4M if problem still exists.

Comment 94 Martin Peres 2019-11-29 17:58:08 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/175.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.