112125 – [regression, bisected] kworker consumes 100% of one CPU core after kernel upgrade from 5.2.* to 5.3.*

Bug 112125 - [regression, bisected] kworker consumes 100% of one CPU core after kernel upgrade from 5.2.* to 5.3.*

Summary: [regression, bisected] kworker consumes 100% of one CPU core after kernel upg...

Status:	RESOLVED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	highest critical
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:	https://bbs.archlinux.org/viewtopic.p...
Whiteboard:	Triaged, ReadyForDev
Keywords:	bisected, regression

Depends on:
Blocks:

Reported:	2019-10-25 02:25 UTC by Val Kulkov
Modified:	2019-10-30 14:44 UTC (History)
CC List:	4 users (show)

See Also:
i915 platform:	BSW/CHT
i915 features:	display/Other

Attachments
2 dmesg outputs: last good and first bad (305.70 KB, application/zip) 2019-10-25 12:07 UTC, Val Kulkov	no flags	Details
Fix to avoid reentering HPD poll detect (3.04 KB, patch) 2019-10-26 19:43 UTC, Imre Deak	no flags	Details \| Splinter Review
View All

Description Val Kulkov 2019-10-25 02:25:01 UTC

A kernel upgrade from 5.2.* to 5.3.* results in 100% usage of one of the available CPU cores. Comparing the output of /proc/interrupts before and after upgrade reveals that upon such upgrade:

- the number of i915 interrupts increases by a factor of 59,
- the number of local timer interrupts increases by a factor of 24, and
- the number of rescheduling interrupts increases by a factor of 6.

Bisecting kernel from 5.2 to 5.3 identifies commit 6cfe7ec02e854278fb341e62db54d49a2b199c62 as the culprit: https://github.com/torvalds/linux/commit/6cfe7ec02e854278fb341e62db54d49a2b199c62

Full details are available at https://bbs.archlinux.org/viewtopic.php?pid=1869944

Comment 1 Val Kulkov 2019-10-25 02:33:19 UTC

For greater clarity:

ad5125d6ef2de22212de82debd0baaecdaa7e7cd is the last good commit, and
6cfe7ec02e854278fb341e62db54d49a2b199c62 is the first bad commit.

Comment 2 Val Kulkov 2019-10-25 04:52:36 UTC

The regression was present in the mainline kernel as of 2019-10-24.

Comment 3 Lakshmi 2019-10-25 07:49:39 UTC

@Imre, Is it worth trying to verify this issue with drmtip?

(In reply to Val Kulkov from comment #2)
> The regression was present in the mainline kernel as of 2019-10-24.
Val, can you attach the dmesg from boot with kernel parameters drm.debug=0x1e log_buf_len=4M ?

Comment 4 Imre Deak 2019-10-25 08:53:27 UTC

(In reply to Lakshmi from comment #3)
> @Imre, Is it worth trying to verify this issue with drmtip?

Yes, it would be helpful to try drmtip too.

> (In reply to Val Kulkov from comment #2)
> > The regression was present in the mainline kernel as of 2019-10-24.
> Val, can you attach the dmesg from boot with kernel parameters
> drm.debug=0x1e log_buf_len=4M ?

Also could you let me know if the issue happens already after boot up, or only after some further steps?

Comment 5 Jani Nikula 2019-10-25 08:57:36 UTC

(In reply to Lakshmi from comment #3)
> @Imre, Is it worth trying to verify this issue with drmtip?
> 
> (In reply to Val Kulkov from comment #2)
> > The regression was present in the mainline kernel as of 2019-10-24.
> Val, can you attach the dmesg from boot with kernel parameters
> drm.debug=0x1e log_buf_len=4M ?

Please do so for both the last good and the first bad commits.

Imre, any ideas why this would happen? I think we probably need to revert and get back to the drawing board with this one.

Comment 6 Imre Deak 2019-10-25 09:11:25 UTC

(In reply to Jani Nikula from comment #5)
> (In reply to Lakshmi from comment #3)
> > @Imre, Is it worth trying to verify this issue with drmtip?
> > 
> > (In reply to Val Kulkov from comment #2)
> > > The regression was present in the mainline kernel as of 2019-10-24.
> > Val, can you attach the dmesg from boot with kernel parameters
> > drm.debug=0x1e log_buf_len=4M ?
> 
> Please do so for both the last good and the first bad commits.
> 
> Imre, any ideas why this would happen? I think we probably need to revert
> and get back to the drawing board with this one.

I can tell more if we get the logs. Reverting will be difficult because the ICL TypeC functionality depends on it, so I'd like to first check the root cause.

Comment 7 Val Kulkov 2019-10-25 12:07:38 UTC

Created attachment 145813 [details]
2 dmesg outputs: last good and first bad

(In reply to Lakshmi from comment #3)
> Val, can you attach the dmesg from boot with kernel parameters
> drm.debug=0x1e log_buf_len=4M ?

Kernel parameters "drm.debug=0x1e log_buf_len=4M" have been added. Both the last good and first bad kernels have been tested. I am attaching a zip archive with two dmesg outputs.

Comment 8 Val Kulkov 2019-10-25 12:15:59 UTC

(In reply to Imre Deak from comment #4)
> 
> Also could you let me know if the issue happens already after boot up, or
> only after some further steps?

This is a headless NUC-like unit with no X display server. The issue with kworker using 100% of a CPU core appears right away. Not sure if the following will help, but anyway here it is:

=== first bad kernel ===
[root@nuc2 ~]# uptime
 07:52:40 up 0 min,  1 user,  load average: 1.80, 0.50, 0.17
[root@nuc2 ~]# uptime
 07:53:28 up 1 min,  1 user,  load average: 1.72, 0.67, 0.25
[root@nuc2 ~]# uptime
 07:57:55 up 5 min,  1 user,  load average: 1.32, 1.10, 0.55

Comment 9 Val Kulkov 2019-10-25 19:48:47 UTC

Interestingly, an Intel NUC D34010WYK on 5.3.* kernel does not have this issue:

[root@nuc1 ~]# uname -a
Linux nuc1 5.3.7-arch1-1-ARCH #1 SMP PREEMPT Fri Oct 18 00:17:03 UTC 2019 x86_64 GNU/Linux
[root@nuc1 ~]# uptime
 15:47:49 up 26 min,  1 user,  load average: 0.00, 0.00, 0.00
[root@nuc1 ~]# lsmod | grep i915
i915                 2326528  2
i2c_algo_bit           16384  1 i915
drm_kms_helper        212992  1 i915
drm                   516096  3 drm_kms_helper,i915
intel_gtt              24576  1 i915

Comment 10 wangqr 2019-10-25 19:53:13 UTC

I'm also seeing the same issue on a headless router. The CPU is Intel N3050

Comment 11 wangqr 2019-10-25 23:25:02 UTC

Since both of us observed this issue on headless machines, I tried pluging in a monitor to the machine. After pluging in a HDMI monitor, the CPU load drops to normal (~0). Then if I unplug the monitor, the CPU load will remain low for approximately 7 seconds, then rise to the level before.

Comment 12 Imre Deak 2019-10-26 19:43:31 UTC

Created attachment 145826 [details] [review]
Fix to avoid reentering HPD poll detect

Could you give a try to the attached patch?

Comment 13 wangqr 2019-10-26 23:21:30 UTC

Yes. I have applied the patch to linux 5.3.7.arch1-2 and it fixes the issue.

Comment 14 Jani Nikula 2019-10-28 11:38:50 UTC

(In reply to Val Kulkov from comment #7)
> I am attaching a zip archive with two dmesg outputs.

*cringe* please always prefer attaching stuff plain text, one file per attachment.

Comment 15 Val Kulkov 2019-10-28 17:31:43 UTC

The patch successfully fixes the issue: tested on linux-drm-tip-git 5.4.865162.dd5bccfa3b5d-1.

Comment 16 Imre Deak 2019-10-30 14:44:40 UTC

Fixed merged to drm-tip:

commit a8ddac7c9f06a12227a4f5febd1cbe0575a33179
Author: Imre Deak <imre.deak@intel.com>
Date:   Mon Oct 28 20:15:17 2019 +0200

    drm/i915: Avoid HPD poll detect triggering a new detect cycle

Thanks for the reports and the accurate initial triaging effort!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.