Bug 111500

Summary: [CFL] Latitude 7400 2-in-1: intermittent screen freezes (i915)
Product: DRI Reporter: Leho Kraav (:macmaN :lkraav) <leho>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: RESOLVED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: major    
Priority: high CC: howaboutsynergy, intel-gfx-bugs
Version: XOrg git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard: Triaged
i915 platform: CFL i915 features:
Attachments:
Description Flags
dmesg.txt
none
dmesg.txt post-freeze
none
i915_display_info.txt post-freeze
none
package_cstate_show.txt post-freeze
none
i915_dmc_info.txt post-freeze
none
i915_edp_psr_status.txt post-freeze
none
dmesg-systemctl-suspend-i915.txt
none
i915_display_info_2019-09-04.txt
none
i915_edp_psr_status.txt
none
ps-aux.txt
none
journalctl-b-early-boot-gdm.txt
none
journalctl-b-drm-tip.txt none

Description Leho Kraav (:macmaN :lkraav) 2019-08-27 12:16:47 UTC
(Originally at https://bugzilla.kernel.org/show_bug.cgi?id=204701)

Hardware: Dell Latitude 7400 2-in-1 with Wacom 0x48C9 touchscreen

Seemingly at random, this Coffeelake gen9 chip graphics engine is intermittently freezing on me. Mouse cursor has been in a Firefox web browser window at least on a few occurrences.

External screen, connected via Type C, also froze.

Kernel stays alive, because I could still reboot via magic sysrq and REISUB.

After rebooting, checking `journalctl -b -1` reveals zero error messages from Xorg, or kernel.

It feels like i915 might be involved with maybe one of the performance parameters (dc, psr, fbc, rc6, etc) being incompatible with this relatively new hardware.

There's also another known bug with this (touch)screen at designware i2c level https://bugzilla.kernel.org/show_bug.cgi?id=204063

Of course, this might just be faulty hardware, but setting up testing a Win10 installation here is a large challenge. I'd like to exclude Linux software stack issues first.

====

Q: what kernel or i915 paramaters would you recommend I test first, to see if freezes can go away?

====

PS I've run this same OS installation on DELL 7480 (KBL) for 2 years without any video issues, so it's either some new gen hardware incompatibility or broken hardware.
Comment 1 Lakshmi 2019-08-28 07:09:58 UTC
Since dmesg is not attached it's hard to say any.
Can you please verify the issue drm-tip (https://cgit.freedesktop.org/drm-tip) with kernel parameters drm.debug=0x1e log_buf_len=4M. If the problem persists attach the full dmesg from boot.
Comment 2 Leho Kraav (:macmaN :lkraav) 2019-08-28 08:36:13 UTC
Created attachment 145184 [details]
dmesg.txt

Indeed, forgot dmesg, attached now.

PS I've currently reached 35h uptime at regular daily workload yesterday - one big change: via elimination process I also turned off Thunderbolt support at BIOS level.

Would not be surprised if Thunderbolt vector is somehow involved. I will post further updates as I get more uptime and behavior data through this week.
Comment 3 Leho Kraav (:macmaN :lkraav) 2019-08-28 09:06:30 UTC
https://lkml.org/lkml/2019/6/29/186 this thread looks incredibly similar to my symptoms.
Comment 4 Leho Kraav (:macmaN :lkraav) 2019-08-28 09:10:27 UTC
IF I manage to reproduce this (with or without Thunderbolt), I will also test `enable_psr=0` per Chris Wilson's comment https://lore.kernel.org/lkml/156283735757.12757.8954391372130933707@skylake-alporthouse-com/
Comment 5 Leho Kraav (:macmaN :lkraav) 2019-08-29 18:07:05 UTC
Today I enabled Touchscreen in BIOS and had the screen freeze a few hours later.

Suspect vector might be ointing away from Thunderbolt -> Touchscreen.
Comment 6 Lakshmi 2019-08-30 08:36:44 UTC
(In reply to Leho Kraav (:macmaN :lkraav) from comment #2)
> Created attachment 145184 [details]
> dmesg.txt
> 
> Indeed, forgot dmesg, attached now.
> 
Can you please attach the full log from boot. Looks like logs are filtered.

[   60.375045] rfkill: input handler disabled
[  730.707123] sda: detected capacity change from 4120903680 to 0
[  836.158591] mce: CPU2: Core temperature above threshold, cpu clock throttled (total events = 1)
[  836.158595] mce: CPU2: Package temperature above threshold, cpu clock throttled (total events = 1)
[  836.158634] mce: CPU0: Package temperature above threshold, cpu clock throttled (total events = 1)
[  836.158635] mce: CPU3: Package temperature above threshold, cpu clock throttled (total events = 1)
[  836.158637] mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 1)
[  836.159596] mce: CPU2: Core temperature/speed normal
[  836.159597] mce: CPU1: Package temperature/speed normal
[  836.159598] mce: CPU0: Package temperature/speed normal
[  836.159599] mce: CPU3: Package temperature/speed normal
[  836.159600] mce: CPU2: Package temperature/speed normal
[  981.590787] usb 1-3: USB disconnect, device number 3
[ 2529.043657] usb 1-4: USB disconnect, device number 5
Comment 7 Leho Kraav (:macmaN :lkraav) 2019-08-30 08:41:12 UTC
dmesg.txt is not filtered, between those timestamps literally nothing got logged at kernel level.

Full `journalctl -b` output might be more descriptive, but there's a bunch of Gnome noise in there..
Comment 8 Leho Kraav (:macmaN :lkraav) 2019-09-03 05:39:01 UTC
Created attachment 145237 [details]
dmesg.txt post-freeze

Another hang last night. It's not the touchscreen, nor Thunderbolt, since both were disabled in BIOS.

I was able to SSH into the machine and poke around some. i915 debug files.

Nothing is displayed in dmesg at default log level.

Today:

* updating to 5.3.0-rc7
* booting `drm.debug=0x1e log_buf_len=4M`
* getting ready to also build `drm-tip`
Comment 9 Leho Kraav (:macmaN :lkraav) 2019-09-03 05:40:52 UTC
Created attachment 145238 [details]
i915_display_info.txt post-freeze

`cat i915_display_info` took a long while to display. It looked like `cat` froze, until I tested suspending the machine.

Overall, nothing I tried was able to unfreeze the screen. No screen operations would work, `xrandr`, `xset dpms` and the like all hung.
Comment 10 Leho Kraav (:macmaN :lkraav) 2019-09-03 05:41:29 UTC
Created attachment 145239 [details]
package_cstate_show.txt post-freeze
Comment 11 Leho Kraav (:macmaN :lkraav) 2019-09-03 05:42:22 UTC
Created attachment 145240 [details]
i915_dmc_info.txt post-freeze

This freeze has occurred both with and without booting `guc_enable`, it doesn't look like it's involved.
Comment 12 Leho Kraav (:macmaN :lkraav) 2019-09-03 05:42:47 UTC
Created attachment 145241 [details]
i915_edp_psr_status.txt post-freeze
Comment 13 Lakshmi 2019-09-03 07:13:34 UTC
(In reply to Leho Kraav (:macmaN :lkraav) from comment #8)
> Created attachment 145237 [details]
> dmesg.txt post-freeze
> 
> Another hang last night. It's not the touchscreen, nor Thunderbolt, since
> both were disabled in BIOS.
> 
> I was able to SSH into the machine and poke around some. i915 debug files.
> 
> Nothing is displayed in dmesg at default log level.
> 
> Today:
> 
> * updating to 5.3.0-rc7
> * booting `drm.debug=0x1e log_buf_len=4M`
> * getting ready to also build `drm-tip`

Attached didn't logged with kernel parameters drm.debug=0x1e log_buf_len=4M. Can you attach with these parameters?
Comment 14 Leho Kraav (:macmaN :lkraav) 2019-09-03 07:49:17 UTC
I'm running a 5.3.0-rc7 session with drm.debug=0x1e now, will post the log as soon as I can reproduce a freeze.

During last night's post-freeze SSH session, I noticed Chrome process was stuck in D state.

I had Slack window in the foreground at the time, but still maybe Chrome triggers this with some graphics operation in the background.

Or Chrome got randomly stuck with a lower level problem.
Comment 15 Leho Kraav (:macmaN :lkraav) 2019-09-04 12:25:24 UTC
Created attachment 145262 [details]
dmesg-systemctl-suspend-i915.txt

Freeze again, few seconds after clicking a Youtube video into fullscreen mode.

At that moment, laptop was lid closed (PSR disabled?), connected to 5120x1440 monitor over USB Type-C.

dmesg w/ drm.debug=0x1e is attached. Nothing specific or weird (to my untrained eye) seems to happen at the tail end of the log, when we freeze.

My suspend attempt also gets blocked:

[102549.527983] PM: suspend entry (s2idle)
[102549.606190] Filesystems sync: 0.078 seconds
[102549.615493] Freezing user space processes ... 
[102569.620234] Freezing of tasks failed after 20.004 seconds (1 tasks refusing to freeze, wq_busy=0):
[102569.620429] spotify         D    0  7139   7138 0x00000004
[102569.620435] Call Trace:
[102569.620448]  ? __schedule+0x1ca/0x4f0
[102569.620452]  schedule+0x31/0xb0
[102569.620456]  schedule_preempt_disabled+0xc/0x20
[102569.620460]  __mutex_lock.isra.1+0x1ff/0x4f0
[102569.620518]  __i915_gem_free_objects+0x7d/0x230 [i915]
[102569.620569]  ? i915_gem_dumb_create+0x90/0x90 [i915]
[102569.620657]  i915_gem_create_ioctl+0x12/0x30 [i915]
[102569.620676]  drm_ioctl_kernel+0xad/0xf0 [drm]
[102569.620692]  drm_ioctl+0x2e6/0x3a0 [drm]
[102569.620741]  ? i915_gem_dumb_create+0x90/0x90 [i915]
[102569.620748]  ? pipe_read+0x2a0/0x2d0
[102569.620753]  do_vfs_ioctl+0xa0/0x620
[102569.620759]  ksys_ioctl+0x35/0x70
[102569.620763]  __x64_sys_ioctl+0x11/0x20
[102569.620768]  do_syscall_64+0x43/0x110
[102569.620774]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[102569.620779] RIP: 0033:0x7f4c07285057
[102569.620789] Code: Bad RIP value.
[102569.620791] RSP: 002b:00007ffeaf8ecae8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[102569.620795] RAX: ffffffffffffffda RBX: 0000000005a2dbf8 RCX: 00007f4c07285057
[102569.620797] RDX: 00007ffeaf8ecb70 RSI: 00000000c010645b RDI: 0000000000000014
[102569.620800] RBP: 00007ffeaf8ecb70 R08: 0000000005d68860 R09: 0000000000000007
[102569.620802] R10: 00007f4c07352c40 R11: 0000000000000246 R12: 00000000c010645b
[102569.620804] R13: 0000000000000014 R14: 00007ffeaf8ecb70 R15: 0000000005a2db88
[102569.620866] OOM killer enabled.
[102569.620868] Restarting tasks ... 
[102569.638470] [drm:drm_atomic_set_fb_for_plane [drm]] Set [FB:116] for [PLANE:45:cursor A] state 000000006d7a0aa2
[102569.638499] [drm:intel_plane_atomic_calc_changes [i915]] [CRTC:48:pipe A] has [PLANE:45:cursor A] with fb 116
[102569.638515] [drm:intel_plane_atomic_calc_changes [i915]] [PLANE:45:cursor A] visible 1 -> 1, off 0, on 0, ms 0
[102569.639014] done.
[102569.677899] PM: suspend exit
Comment 16 Leho Kraav (:macmaN :lkraav) 2019-09-04 12:27:04 UTC
Created attachment 145263 [details]
i915_display_info_2019-09-04.txt
Comment 17 Leho Kraav (:macmaN :lkraav) 2019-09-04 12:28:34 UTC
Created attachment 145264 [details]
i915_edp_psr_status.txt

PSR mode: disabled, because Lid is closed, right?
Comment 18 Leho Kraav (:macmaN :lkraav) 2019-09-04 12:34:48 UTC
Created attachment 145265 [details]
ps-aux.txt

Process list shows Spotify PID 7139 with argument `--type=gpu-process` is in state D (uninterruptible sleep).

Spotify app was sitting in some background workspace.

Something is going wrong with apps using hardware acceleration?

Any new thoughts based on these logs? Anything else I can log, or perhaps I should backtrack to a latest stable kernel like 4.19? It'd suck to lose 5.3.0 s2idle improvements from https://bugzilla.kernel.org/show_bug.cgi?id=199689 but need to get this freeze issue fixed as well.
Comment 19 Leho Kraav (:macmaN :lkraav) 2019-09-04 14:14:15 UTC
Another freeze shortly into the next session, after vertically expanding a Thunderbird mail window. `khugepaged` and `kworker+i915` were stuck in D state now.

I'm preparing to roll back towards 4.19.69 LTS and see how that version behaves.

Or does drm-tip have a greater chance of revealing something?

I wonder when is it time to suspect hardware issues?
Comment 20 Lakshmi 2019-09-06 11:14:58 UTC
(In reply to Leho Kraav (:macmaN :lkraav) from comment #19)

> Or does drm-tip have a greater chance of revealing something?
We recommend to verify the issue with drmtip https://cgit.freedesktop.org/drm-tip. 
Also please attach full dmesg from boot with kernel parameters drm.debug=0x1e log_buf_len=4M. 

That might give more details about the issue. In the attached logs either the logs are not from boot or the logs are not taken with kernel parameters drm.debug=0x1e log_buf_len=4M.
Comment 21 Leho Kraav (:macmaN :lkraav) 2019-09-06 11:50:03 UTC
Created attachment 145277 [details]
journalctl-b-early-boot-gdm.txt

> In the attached logs either the logs are not from boot or the logs are not taken with kernel parameters drm.debug=0x1e log_buf_len=4M.

Oh, you specifically need the boot early phase output of `drm.debug=0x1e`? I also saved the full `journalctl -b` (137M file), cutting and attaching early phase log now. This includes loading gdm, but stops right before loading my user Xorg session.

Let me know if you also need any output from my user Xorg session, maybe the initial post-load part, or something.

(Because 0x1e produces output on every mouse move, I was confused about how is it possible to get much info within log_buf_len=4M window, because majority of it would be mouse moves.)

Based on seeing `khugepaged` hung during previous freeze, I'm currently testing 5.3.0-rc7 with `transparent_hugepage=never` and am currently approaching 48 h uptime. If this still freezes at some point, I will test drm-tip next.
Comment 22 Leho Kraav (:macmaN :lkraav) 2019-09-08 14:54:46 UTC
5.3.0-rc7 with `transparent_hugepage=never` froze at ~90h uptime.

Stuck processes:

root        40  0.0  0.0      0     0 ?        D    sept 04   0:01 [kcompactd0]
root     15609  0.0  0.0      0     0 ?        D    sept 07   0:00 [kworker/u8:24+i915]
leho     22473  0.0  0.4 424868 66448 tty2     Dl+  sept 05   0:41 /opt/google/chrome/chrome --type=gpu-process --field-trial-handle=10498227519519038222,5127582014693209854,131072 --enable-crash-reporter=36765117-52CC-FCEE-D12A-477CC394751F, --gpu-preferences=IAAAAAAAAAAgAAAgAAAAAAAAYAAAAAAACAAAAAAAAAAIAAAAAAAAAA== --service-request-channel-token=10254689475338789367

Built and booted today's drm-tip now, with drm.debug=0x1e, let's see how this goes.
Comment 23 Leho Kraav (:macmaN :lkraav) 2019-09-08 15:14:53 UTC
Created attachment 145300 [details]
journalctl-b-drm-tip.txt

drm-tip early boot log.
Comment 24 Lakshmi 2019-09-09 07:35:51 UTC
(In reply to Leho Kraav (:macmaN :lkraav) from comment #23)
> Created attachment 145300 [details]
> journalctl-b-drm-tip.txt
> 
> drm-tip early boot log.

Did you see any freeze here?
Comment 25 Leho Kraav (:macmaN :lkraav) 2019-09-09 08:03:53 UTC
(In reply to Lakshmi from comment #24)
> (In reply to Leho Kraav (:macmaN :lkraav) from comment #23)
> > Created attachment 145300 [details]
> > journalctl-b-drm-tip.txt
> > 
> > drm-tip early boot log.
> 
> Did you see any freeze here?

This drm-tip session is currently at 17h uptime, it hasn't frozen yet. If the problem still exists, it might take several days to manifest, as it has done so before (seemingly quite randomly, on surface..).
Comment 26 Leho Kraav (:macmaN :lkraav) 2019-09-09 12:07:55 UTC
I just got the same freeze on drm-tip (* d45d78ff950b - drm-tip: 2019y-09m-08d-15h-18m-05s UTC integration manifest (21 hours ago) <Chris Wilson>) by just moving the mouse around in a Firefox window.

Since I rebooted after my last message here, this freeze only took 2h 45min into the session.

Similar process freeze as earlier:

$ [-] cat ps-aux-grep-D.txt 
root        40  0.0  0.0      0     0 ?        D    11:46   0:01 [kcompactd0]
root     18220  0.0  0.0      0     0 ?        D    13:45   0:00 [kworker/u8:9+i915]

I'm going to boot 4.19.71 next.

Lakshmi, any next step ideas from your end? Should I enabled CONFIG_EXPERT and CONFIG_I915_DEBUG family perhaps?
Comment 27 Lakshmi 2019-09-09 13:07:44 UTC
(In reply to Leho Kraav (:macmaN :lkraav) from comment #26)
> I just got the same freeze on drm-tip (* d45d78ff950b - drm-tip:
> 2019y-09m-08d-15h-18m-05s UTC integration manifest (21 hours ago) <Chris
> Wilson>) by just moving the mouse around in a Firefox window.
> 
> Since I rebooted after my last message here, this freeze only took 2h 45min
> into the session.
> 
> Similar process freeze as earlier:
> 
> $ [-] cat ps-aux-grep-D.txt 
> root        40  0.0  0.0      0     0 ?        D    11:46   0:01 [kcompactd0]
> root     18220  0.0  0.0      0     0 ?        D    13:45   0:00
> [kworker/u8:9+i915]
> 
> I'm going to boot 4.19.71 next.
> 
> Lakshmi, any next step ideas from your end? Should I enabled CONFIG_EXPERT
> and CONFIG_I915_DEBUG family perhaps?
Thanks for the feedback from drmtip. Can you please attach the dmseg log from boot with debug parameters? 
What is the impact of this issue as a user? How do you recover this situation?
Comment 28 Leho Kraav (:macmaN :lkraav) 2019-09-09 13:28:19 UTC
(In reply to Lakshmi from comment #27)
> (In reply to Leho Kraav (:macmaN :lkraav) from comment #26)
> > I just got the same freeze on drm-tip (* d45d78ff950b - drm-tip:
> > 2019y-09m-08d-15h-18m-05s UTC integration manifest (21 hours ago) <Chris
> > Wilson>) by just moving the mouse around in a Firefox window.
> > 
> Thanks for the feedback from drmtip. Can you please attach the dmseg log
> from boot with debug parameters? 

Argh, that last short session I forgot to add drm.debug parameter.

But my latest drm-tip boot log attachment from yesterday https://bugs.freedesktop.org/attachment.cgi?id=145300&action=edit should be exactly the same, at least for the early phase?

Alternatively, if my current 4.19.71 session should freeze, I will re-do a drm-tip session with drm.debug, and upon freeze, will be able to capture full `journalctl -b` output for the whole session (it will be several hundred MB uncompressed).

> What is the impact of this issue as a user? How do you recover this
> situation?

It's a complete deal-breaker, there is no recovery other than forced reboot via power button (or Magic SysRq > REISUB). There seems to be no way to unfreeze the screen. I haven't experienced this type of an difficult to solve graphics stack issue on DELL laptop hardware for years.

I generally like this Latitude 7400 2-in-1 physical hardware quality, it's the only reason I'm going through so much trouble trying to identify whether it's a Linux software issue.

I ran all pre-BIOS hardware tests yesterday, also in Thorough mode, everything passed w/ green checkmark - Memory, Processor, Video, etc.

Testing Win10 on this machine is prohibitively difficult workflow-wise, even though it'd be the best way to eliminate "faulty hardware" vector.

I also have another contact in the wild, who has been running this Latitude 7400 2-in-1 hardware on Linux 4.19 branch, and while there have been other problems, he confirmed not having experienced such screen freezes.

Ideally I'd test my setup with another identical laptop, but I have my doubts if the DELL people are going to supply me with one without purchase.
Comment 29 Leho Kraav (:macmaN :lkraav) 2019-09-10 14:08:46 UTC
Look what I just found "5.3-rc3: Frozen graphics with kcompactd migrating i915 pages" https://lkml.org/lkml/2019/8/9/433
Comment 30 Dq8CokMHloQZw 2019-09-10 14:51:59 UTC
So it's the same as https://bugs.freedesktop.org/show_bug.cgi?id=111601#c11
according to the stacktraces during the frozen time via ssh.
Comment 31 Dq8CokMHloQZw 2019-09-10 14:56:48 UTC
Actually I take that back, was based only on the `__i915_gem_free_objects` stacktrace and it's not even as similar as I initially thought, hmm...

Maybe devs would know.
Comment 32 Leho Kraav (:macmaN :lkraav) 2019-09-10 15:00:46 UTC
I finally found a match for your LKML thread on search engines with the keywords "freeze" next to "kcompactd". Previous searches for "locked, hung" etc utterly failed.

I'm about to test reverting aa56a292ce623734ddd30f52d73f527d1f3529b5.
Comment 33 Dq8CokMHloQZw 2019-09-10 17:01:34 UTC
According to Comment 29 lkml thread it is the same issue and so I guess reverting that commit would work for you too then.

So disregard my Comment 31 :)
Comment 34 Leho Kraav (:macmaN :lkraav) 2019-09-19 18:07:49 UTC
Hi Francesco. This bug was fixed by a revert commit for 5.3.0 release https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=505a8ec7e11ae5236c4a154a1e24ef49a8349600
Comment 35 Francesco Balestrieri 2019-09-24 05:24:50 UTC
Thanks for confirming!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.