(Originally at https://bugzilla.kernel.org/show_bug.cgi?id=204701)
Hardware: Dell Latitude 7400 2-in-1 with Wacom 0x48C9 touchscreen
Seemingly at random, this Coffeelake gen9 chip graphics engine is intermittently freezing on me. Mouse cursor has been in a Firefox web browser window at least on a few occurrences.
External screen, connected via Type C, also froze.
Kernel stays alive, because I could still reboot via magic sysrq and REISUB.
After rebooting, checking `journalctl -b -1` reveals zero error messages from Xorg, or kernel.
It feels like i915 might be involved with maybe one of the performance parameters (dc, psr, fbc, rc6, etc) being incompatible with this relatively new hardware.
There's also another known bug with this (touch)screen at designware i2c level https://bugzilla.kernel.org/show_bug.cgi?id=204063
Of course, this might just be faulty hardware, but setting up testing a Win10 installation here is a large challenge. I'd like to exclude Linux software stack issues first.
Q: what kernel or i915 paramaters would you recommend I test first, to see if freezes can go away?
PS I've run this same OS installation on DELL 7480 (KBL) for 2 years without any video issues, so it's either some new gen hardware incompatibility or broken hardware.
Since dmesg is not attached it's hard to say any.
Can you please verify the issue drm-tip (https://cgit.freedesktop.org/drm-tip) with kernel parameters drm.debug=0x1e log_buf_len=4M. If the problem persists attach the full dmesg from boot.
Created attachment 145184 [details]
Indeed, forgot dmesg, attached now.
PS I've currently reached 35h uptime at regular daily workload yesterday - one big change: via elimination process I also turned off Thunderbolt support at BIOS level.
Would not be surprised if Thunderbolt vector is somehow involved. I will post further updates as I get more uptime and behavior data through this week.
https://lkml.org/lkml/2019/6/29/186 this thread looks incredibly similar to my symptoms.
IF I manage to reproduce this (with or without Thunderbolt), I will also test `enable_psr=0` per Chris Wilson's comment https://lore.kernel.org/lkml/156283735757.12757.8954391372130933707@skylake-alporthouse-com/
Today I enabled Touchscreen in BIOS and had the screen freeze a few hours later.
Suspect vector might be ointing away from Thunderbolt -> Touchscreen.
(In reply to Leho Kraav (:macmaN :lkraav) from comment #2)
> Created attachment 145184 [details]
> Indeed, forgot dmesg, attached now.
Can you please attach the full log from boot. Looks like logs are filtered.
[ 60.375045] rfkill: input handler disabled
[ 730.707123] sda: detected capacity change from 4120903680 to 0
[ 836.158591] mce: CPU2: Core temperature above threshold, cpu clock throttled (total events = 1)
[ 836.158595] mce: CPU2: Package temperature above threshold, cpu clock throttled (total events = 1)
[ 836.158634] mce: CPU0: Package temperature above threshold, cpu clock throttled (total events = 1)
[ 836.158635] mce: CPU3: Package temperature above threshold, cpu clock throttled (total events = 1)
[ 836.158637] mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 1)
[ 836.159596] mce: CPU2: Core temperature/speed normal
[ 836.159597] mce: CPU1: Package temperature/speed normal
[ 836.159598] mce: CPU0: Package temperature/speed normal
[ 836.159599] mce: CPU3: Package temperature/speed normal
[ 836.159600] mce: CPU2: Package temperature/speed normal
[ 981.590787] usb 1-3: USB disconnect, device number 3
[ 2529.043657] usb 1-4: USB disconnect, device number 5
dmesg.txt is not filtered, between those timestamps literally nothing got logged at kernel level.
Full `journalctl -b` output might be more descriptive, but there's a bunch of Gnome noise in there..
Created attachment 145237 [details]
Another hang last night. It's not the touchscreen, nor Thunderbolt, since both were disabled in BIOS.
I was able to SSH into the machine and poke around some. i915 debug files.
Nothing is displayed in dmesg at default log level.
* updating to 5.3.0-rc7
* booting `drm.debug=0x1e log_buf_len=4M`
* getting ready to also build `drm-tip`
Created attachment 145238 [details]
`cat i915_display_info` took a long while to display. It looked like `cat` froze, until I tested suspending the machine.
Overall, nothing I tried was able to unfreeze the screen. No screen operations would work, `xrandr`, `xset dpms` and the like all hung.
Created attachment 145239 [details]
Created attachment 145240 [details]
This freeze has occurred both with and without booting `guc_enable`, it doesn't look like it's involved.
Created attachment 145241 [details]
(In reply to Leho Kraav (:macmaN :lkraav) from comment #8)
> Created attachment 145237 [details]
> dmesg.txt post-freeze
> Another hang last night. It's not the touchscreen, nor Thunderbolt, since
> both were disabled in BIOS.
> I was able to SSH into the machine and poke around some. i915 debug files.
> Nothing is displayed in dmesg at default log level.
> * updating to 5.3.0-rc7
> * booting `drm.debug=0x1e log_buf_len=4M`
> * getting ready to also build `drm-tip`
Attached didn't logged with kernel parameters drm.debug=0x1e log_buf_len=4M. Can you attach with these parameters?
I'm running a 5.3.0-rc7 session with drm.debug=0x1e now, will post the log as soon as I can reproduce a freeze.
During last night's post-freeze SSH session, I noticed Chrome process was stuck in D state.
I had Slack window in the foreground at the time, but still maybe Chrome triggers this with some graphics operation in the background.
Or Chrome got randomly stuck with a lower level problem.
Created attachment 145262 [details]
Freeze again, few seconds after clicking a Youtube video into fullscreen mode.
At that moment, laptop was lid closed (PSR disabled?), connected to 5120x1440 monitor over USB Type-C.
dmesg w/ drm.debug=0x1e is attached. Nothing specific or weird (to my untrained eye) seems to happen at the tail end of the log, when we freeze.
My suspend attempt also gets blocked:
[102549.527983] PM: suspend entry (s2idle)
[102549.606190] Filesystems sync: 0.078 seconds
[102549.615493] Freezing user space processes ...
[102569.620234] Freezing of tasks failed after 20.004 seconds (1 tasks refusing to freeze, wq_busy=0):
[102569.620429] spotify D 0 7139 7138 0x00000004
[102569.620435] Call Trace:
[102569.620448] ? __schedule+0x1ca/0x4f0
[102569.620518] __i915_gem_free_objects+0x7d/0x230 [i915]
[102569.620569] ? i915_gem_dumb_create+0x90/0x90 [i915]
[102569.620657] i915_gem_create_ioctl+0x12/0x30 [i915]
[102569.620676] drm_ioctl_kernel+0xad/0xf0 [drm]
[102569.620692] drm_ioctl+0x2e6/0x3a0 [drm]
[102569.620741] ? i915_gem_dumb_create+0x90/0x90 [i915]
[102569.620748] ? pipe_read+0x2a0/0x2d0
[102569.620779] RIP: 0033:0x7f4c07285057
[102569.620789] Code: Bad RIP value.
[102569.620791] RSP: 002b:00007ffeaf8ecae8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[102569.620795] RAX: ffffffffffffffda RBX: 0000000005a2dbf8 RCX: 00007f4c07285057
[102569.620797] RDX: 00007ffeaf8ecb70 RSI: 00000000c010645b RDI: 0000000000000014
[102569.620800] RBP: 00007ffeaf8ecb70 R08: 0000000005d68860 R09: 0000000000000007
[102569.620802] R10: 00007f4c07352c40 R11: 0000000000000246 R12: 00000000c010645b
[102569.620804] R13: 0000000000000014 R14: 00007ffeaf8ecb70 R15: 0000000005a2db88
[102569.620866] OOM killer enabled.
[102569.620868] Restarting tasks ...
[102569.638470] [drm:drm_atomic_set_fb_for_plane [drm]] Set [FB:116] for [PLANE:45:cursor A] state 000000006d7a0aa2
[102569.638499] [drm:intel_plane_atomic_calc_changes [i915]] [CRTC:48:pipe A] has [PLANE:45:cursor A] with fb 116
[102569.638515] [drm:intel_plane_atomic_calc_changes [i915]] [PLANE:45:cursor A] visible 1 -> 1, off 0, on 0, ms 0
[102569.677899] PM: suspend exit
Created attachment 145263 [details]
Created attachment 145264 [details]
PSR mode: disabled, because Lid is closed, right?
Created attachment 145265 [details]
Process list shows Spotify PID 7139 with argument `--type=gpu-process` is in state D (uninterruptible sleep).
Spotify app was sitting in some background workspace.
Something is going wrong with apps using hardware acceleration?
Any new thoughts based on these logs? Anything else I can log, or perhaps I should backtrack to a latest stable kernel like 4.19? It'd suck to lose 5.3.0 s2idle improvements from https://bugzilla.kernel.org/show_bug.cgi?id=199689 but need to get this freeze issue fixed as well.
Another freeze shortly into the next session, after vertically expanding a Thunderbird mail window. `khugepaged` and `kworker+i915` were stuck in D state now.
I'm preparing to roll back towards 4.19.69 LTS and see how that version behaves.
Or does drm-tip have a greater chance of revealing something?
I wonder when is it time to suspect hardware issues?
(In reply to Leho Kraav (:macmaN :lkraav) from comment #19)
> Or does drm-tip have a greater chance of revealing something?
We recommend to verify the issue with drmtip https://cgit.freedesktop.org/drm-tip.
Also please attach full dmesg from boot with kernel parameters drm.debug=0x1e log_buf_len=4M.
That might give more details about the issue. In the attached logs either the logs are not from boot or the logs are not taken with kernel parameters drm.debug=0x1e log_buf_len=4M.
Created attachment 145277 [details]
> In the attached logs either the logs are not from boot or the logs are not taken with kernel parameters drm.debug=0x1e log_buf_len=4M.
Oh, you specifically need the boot early phase output of `drm.debug=0x1e`? I also saved the full `journalctl -b` (137M file), cutting and attaching early phase log now. This includes loading gdm, but stops right before loading my user Xorg session.
Let me know if you also need any output from my user Xorg session, maybe the initial post-load part, or something.
(Because 0x1e produces output on every mouse move, I was confused about how is it possible to get much info within log_buf_len=4M window, because majority of it would be mouse moves.)
Based on seeing `khugepaged` hung during previous freeze, I'm currently testing 5.3.0-rc7 with `transparent_hugepage=never` and am currently approaching 48 h uptime. If this still freezes at some point, I will test drm-tip next.
5.3.0-rc7 with `transparent_hugepage=never` froze at ~90h uptime.
root 40 0.0 0.0 0 0 ? D sept 04 0:01 [kcompactd0]
root 15609 0.0 0.0 0 0 ? D sept 07 0:00 [kworker/u8:24+i915]
leho 22473 0.0 0.4 424868 66448 tty2 Dl+ sept 05 0:41 /opt/google/chrome/chrome --type=gpu-process --field-trial-handle=10498227519519038222,5127582014693209854,131072 --enable-crash-reporter=36765117-52CC-FCEE-D12A-477CC394751F, --gpu-preferences=IAAAAAAAAAAgAAAgAAAAAAAAYAAAAAAACAAAAAAAAAAIAAAAAAAAAA== --service-request-channel-token=10254689475338789367
Built and booted today's drm-tip now, with drm.debug=0x1e, let's see how this goes.
Created attachment 145300 [details]
drm-tip early boot log.
(In reply to Leho Kraav (:macmaN :lkraav) from comment #23)
> Created attachment 145300 [details]
> drm-tip early boot log.
Did you see any freeze here?
(In reply to Lakshmi from comment #24)
> (In reply to Leho Kraav (:macmaN :lkraav) from comment #23)
> > Created attachment 145300 [details]
> > journalctl-b-drm-tip.txt
> > drm-tip early boot log.
> Did you see any freeze here?
This drm-tip session is currently at 17h uptime, it hasn't frozen yet. If the problem still exists, it might take several days to manifest, as it has done so before (seemingly quite randomly, on surface..).
I just got the same freeze on drm-tip (* d45d78ff950b - drm-tip: 2019y-09m-08d-15h-18m-05s UTC integration manifest (21 hours ago) <Chris Wilson>) by just moving the mouse around in a Firefox window.
Since I rebooted after my last message here, this freeze only took 2h 45min into the session.
Similar process freeze as earlier:
$ [-] cat ps-aux-grep-D.txt
root 40 0.0 0.0 0 0 ? D 11:46 0:01 [kcompactd0]
root 18220 0.0 0.0 0 0 ? D 13:45 0:00 [kworker/u8:9+i915]
I'm going to boot 4.19.71 next.
Lakshmi, any next step ideas from your end? Should I enabled CONFIG_EXPERT and CONFIG_I915_DEBUG family perhaps?
(In reply to Leho Kraav (:macmaN :lkraav) from comment #26)
> I just got the same freeze on drm-tip (* d45d78ff950b - drm-tip:
> 2019y-09m-08d-15h-18m-05s UTC integration manifest (21 hours ago) <Chris
> Wilson>) by just moving the mouse around in a Firefox window.
> Since I rebooted after my last message here, this freeze only took 2h 45min
> into the session.
> Similar process freeze as earlier:
> $ [-] cat ps-aux-grep-D.txt
> root 40 0.0 0.0 0 0 ? D 11:46 0:01 [kcompactd0]
> root 18220 0.0 0.0 0 0 ? D 13:45 0:00
> I'm going to boot 4.19.71 next.
> Lakshmi, any next step ideas from your end? Should I enabled CONFIG_EXPERT
> and CONFIG_I915_DEBUG family perhaps?
Thanks for the feedback from drmtip. Can you please attach the dmseg log from boot with debug parameters?
What is the impact of this issue as a user? How do you recover this situation?
(In reply to Lakshmi from comment #27)
> (In reply to Leho Kraav (:macmaN :lkraav) from comment #26)
> > I just got the same freeze on drm-tip (* d45d78ff950b - drm-tip:
> > 2019y-09m-08d-15h-18m-05s UTC integration manifest (21 hours ago) <Chris
> > Wilson>) by just moving the mouse around in a Firefox window.
> Thanks for the feedback from drmtip. Can you please attach the dmseg log
> from boot with debug parameters?
Argh, that last short session I forgot to add drm.debug parameter.
But my latest drm-tip boot log attachment from yesterday https://bugs.freedesktop.org/attachment.cgi?id=145300&action=edit should be exactly the same, at least for the early phase?
Alternatively, if my current 4.19.71 session should freeze, I will re-do a drm-tip session with drm.debug, and upon freeze, will be able to capture full `journalctl -b` output for the whole session (it will be several hundred MB uncompressed).
> What is the impact of this issue as a user? How do you recover this
It's a complete deal-breaker, there is no recovery other than forced reboot via power button (or Magic SysRq > REISUB). There seems to be no way to unfreeze the screen. I haven't experienced this type of an difficult to solve graphics stack issue on DELL laptop hardware for years.
I generally like this Latitude 7400 2-in-1 physical hardware quality, it's the only reason I'm going through so much trouble trying to identify whether it's a Linux software issue.
I ran all pre-BIOS hardware tests yesterday, also in Thorough mode, everything passed w/ green checkmark - Memory, Processor, Video, etc.
Testing Win10 on this machine is prohibitively difficult workflow-wise, even though it'd be the best way to eliminate "faulty hardware" vector.
I also have another contact in the wild, who has been running this Latitude 7400 2-in-1 hardware on Linux 4.19 branch, and while there have been other problems, he confirmed not having experienced such screen freezes.
Ideally I'd test my setup with another identical laptop, but I have my doubts if the DELL people are going to supply me with one without purchase.
Look what I just found "5.3-rc3: Frozen graphics with kcompactd migrating i915 pages" https://lkml.org/lkml/2019/8/9/433
So it's the same as https://bugs.freedesktop.org/show_bug.cgi?id=111601#c11
according to the stacktraces during the frozen time via ssh.
Actually I take that back, was based only on the `__i915_gem_free_objects` stacktrace and it's not even as similar as I initially thought, hmm...
Maybe devs would know.
I finally found a match for your LKML thread on search engines with the keywords "freeze" next to "kcompactd". Previous searches for "locked, hung" etc utterly failed.
I'm about to test reverting aa56a292ce623734ddd30f52d73f527d1f3529b5.
According to Comment 29 lkml thread it is the same issue and so I guess reverting that commit would work for you too then.
So disregard my Comment 31 :)