Summary: | [regression, bisected] kworker consumes 100% of one CPU core after kernel upgrade from 5.2.* to 5.3.* | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Val Kulkov <val.kulkov> | ||||||
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||
Status: | RESOLVED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||
Severity: | critical | ||||||||
Priority: | highest | CC: | imre.deak, intel-gfx-bugs, val.kulkov, wqr.prg | ||||||
Version: | unspecified | Keywords: | bisected, regression | ||||||
Hardware: | x86-64 (AMD64) | ||||||||
OS: | Linux (All) | ||||||||
URL: | https://bbs.archlinux.org/viewtopic.php?pid=1869944 | ||||||||
Whiteboard: | Triaged, ReadyForDev | ||||||||
i915 platform: | BSW/CHT | i915 features: | display/Other | ||||||
Attachments: |
|
Description
Val Kulkov
2019-10-25 02:25:01 UTC
For greater clarity: ad5125d6ef2de22212de82debd0baaecdaa7e7cd is the last good commit, and 6cfe7ec02e854278fb341e62db54d49a2b199c62 is the first bad commit. The regression was present in the mainline kernel as of 2019-10-24. @Imre, Is it worth trying to verify this issue with drmtip? (In reply to Val Kulkov from comment #2) > The regression was present in the mainline kernel as of 2019-10-24. Val, can you attach the dmesg from boot with kernel parameters drm.debug=0x1e log_buf_len=4M ? (In reply to Lakshmi from comment #3) > @Imre, Is it worth trying to verify this issue with drmtip? Yes, it would be helpful to try drmtip too. > (In reply to Val Kulkov from comment #2) > > The regression was present in the mainline kernel as of 2019-10-24. > Val, can you attach the dmesg from boot with kernel parameters > drm.debug=0x1e log_buf_len=4M ? Also could you let me know if the issue happens already after boot up, or only after some further steps? (In reply to Lakshmi from comment #3) > @Imre, Is it worth trying to verify this issue with drmtip? > > (In reply to Val Kulkov from comment #2) > > The regression was present in the mainline kernel as of 2019-10-24. > Val, can you attach the dmesg from boot with kernel parameters > drm.debug=0x1e log_buf_len=4M ? Please do so for both the last good and the first bad commits. Imre, any ideas why this would happen? I think we probably need to revert and get back to the drawing board with this one. (In reply to Jani Nikula from comment #5) > (In reply to Lakshmi from comment #3) > > @Imre, Is it worth trying to verify this issue with drmtip? > > > > (In reply to Val Kulkov from comment #2) > > > The regression was present in the mainline kernel as of 2019-10-24. > > Val, can you attach the dmesg from boot with kernel parameters > > drm.debug=0x1e log_buf_len=4M ? > > Please do so for both the last good and the first bad commits. > > Imre, any ideas why this would happen? I think we probably need to revert > and get back to the drawing board with this one. I can tell more if we get the logs. Reverting will be difficult because the ICL TypeC functionality depends on it, so I'd like to first check the root cause. Created attachment 145813 [details] 2 dmesg outputs: last good and first bad (In reply to Lakshmi from comment #3) > Val, can you attach the dmesg from boot with kernel parameters > drm.debug=0x1e log_buf_len=4M ? Kernel parameters "drm.debug=0x1e log_buf_len=4M" have been added. Both the last good and first bad kernels have been tested. I am attaching a zip archive with two dmesg outputs. (In reply to Imre Deak from comment #4) > > Also could you let me know if the issue happens already after boot up, or > only after some further steps? This is a headless NUC-like unit with no X display server. The issue with kworker using 100% of a CPU core appears right away. Not sure if the following will help, but anyway here it is: === first bad kernel === [root@nuc2 ~]# uptime 07:52:40 up 0 min, 1 user, load average: 1.80, 0.50, 0.17 [root@nuc2 ~]# uptime 07:53:28 up 1 min, 1 user, load average: 1.72, 0.67, 0.25 [root@nuc2 ~]# uptime 07:57:55 up 5 min, 1 user, load average: 1.32, 1.10, 0.55 Interestingly, an Intel NUC D34010WYK on 5.3.* kernel does not have this issue: [root@nuc1 ~]# uname -a Linux nuc1 5.3.7-arch1-1-ARCH #1 SMP PREEMPT Fri Oct 18 00:17:03 UTC 2019 x86_64 GNU/Linux [root@nuc1 ~]# uptime 15:47:49 up 26 min, 1 user, load average: 0.00, 0.00, 0.00 [root@nuc1 ~]# lsmod | grep i915 i915 2326528 2 i2c_algo_bit 16384 1 i915 drm_kms_helper 212992 1 i915 drm 516096 3 drm_kms_helper,i915 intel_gtt 24576 1 i915 I'm also seeing the same issue on a headless router. The CPU is Intel N3050 Since both of us observed this issue on headless machines, I tried pluging in a monitor to the machine. After pluging in a HDMI monitor, the CPU load drops to normal (~0). Then if I unplug the monitor, the CPU load will remain low for approximately 7 seconds, then rise to the level before. Created attachment 145826 [details] [review] Fix to avoid reentering HPD poll detect Could you give a try to the attached patch? Yes. I have applied the patch to linux 5.3.7.arch1-2 and it fixes the issue. (In reply to Val Kulkov from comment #7) > I am attaching a zip archive with two dmesg outputs. *cringe* please always prefer attaching stuff plain text, one file per attachment. The patch successfully fixes the issue: tested on linux-drm-tip-git 5.4.865162.dd5bccfa3b5d-1. Fixed merged to drm-tip: commit a8ddac7c9f06a12227a4f5febd1cbe0575a33179 Author: Imre Deak <imre.deak@intel.com> Date: Mon Oct 28 20:15:17 2019 +0200 drm/i915: Avoid HPD poll detect triggering a new detect cycle Thanks for the reports and the accurate initial triaging effort! |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.