Bug 112315 - 5.3.11 regression: No RC6 on Kaby Lake
Summary: 5.3.11 regression: No RC6 on Kaby Lake
Status: RESOLVED MOVED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: x86-64 (AMD64) Linux (All)
: highest critical
Assignee: Chris Wilson
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: Triaged, ReadyForDev
Keywords: bisected, regression
Depends on:
Blocks:
 
Reported: 2019-11-18 09:47 UTC by Tomas Janousek
Modified: 2019-11-29 19:48 UTC (History)
6 users (show)

See Also:
i915 platform: BDW, KBL, SKL
i915 features: GEM/Other


Attachments
dmesg (162.08 KB, text/plain)
2019-11-18 10:10 UTC, Tomas Janousek
no flags Details
/sys/class/drm/card0/error after a GPU hang (4.75 KB, text/plain)
2019-11-25 19:28 UTC, Michael Marley
no flags Details

Description Tomas Janousek 2019-11-18 09:47:39 UTC
"drm/i915/gen8+: Add RC6 CTX corruption WA" (d4360736a7c0a6326e3bbdf7d41181f6ed03d9a6) in 5.3 stable broke RC6 on my Kaby Lake ThinkPad 25 (T470 equiv). This prevents the CPU package from entering package C-states. Happens every time, no suspend/resume needed. The commit message suggests suspend/resume may actually help, but that's not the case here either. No mention of "RC6 context corrupted" in dmesg, but perhaps I need to set drm.debug?

Reverting the commit fixes the issue.
Comment 1 Tomas Janousek 2019-11-18 10:10:07 UTC
Created attachment 145988 [details]
dmesg

One additional observation: it's okay (nearly 100% in rc6 according to powertop) until I start Xorg. Then it's 100% powered on, 0% rc6.

Attaching dmesg | grep drm with drm.debug=0xe. Xorg was started at 10:57:04.
Comment 2 Chris Wilson 2019-11-18 10:10:54 UTC
Correct; that patch disables rc6 while active to prevent catastrophe. And yes, no rc6 is itself pretty catastrophic.

Could you please do something like:

$ perf stat -a -x, -r 1 \
	-e "power/energy-pkg/" \
	-e "power/energy-cores/" \
	-e "power/energy-gpu/" \
	-e "i915/actual-frequency/" \
	-e "i915/rc6-residency/" \
	-e "i915/rcs0-busy/" \
	-e "i915/bcs0-busy/" \
	-e "i915/vcs0-busy/" \
  sleep 300

while you do your normal activities, and report before/after? (Trying to do the same activity in each sample.)
Comment 3 Chris Wilson 2019-11-18 10:11:48 UTC
If you feel daring, you can try https://patchwork.freedesktop.org/series/69591/
Comment 4 Tomas Janousek 2019-11-18 12:46:40 UTC
good (5.3.11 + revert d4360736a7c0a6326e3bbdf7d41181f6ed03d9a6):

328,85,Joules,power/energy-pkg/,299993660112,100,00,,
65,17,Joules,power/energy-cores/,299993673518,100,00,,
26,13,Joules,power/energy-gpu/,299993681101,100,00,,
91077,MHz,i915/actual-frequency/,299993685777,100,00,,
54314227200,ns,i915/rc6-residency/,299993692616,100,00,,
1944679051,ns,i915/rcs0-busy/,299993699743,100,00,,
0,ns,i915/bcs0-busy/,299993706507,100,00,,
0,ns,i915/vcs0-busy/,299993710255,100,00,,

bad (5.3.11):

387,82,Joules,power/energy-pkg/,299995076940,100,00,,
73,07,Joules,power/energy-cores/,299995088838,100,00,,
63,22,Joules,power/energy-gpu/,299995095576,100,00,,
91209,MHz,i915/actual-frequency/,299995099867,100,00,,
0,ns,i915/rc6-residency/,299995106772,100,00,,
966657080,ns,i915/rcs0-busy/,299995113918,100,00,,
0,ns,i915/bcs0-busy/,299995120940,100,00,,
0,ns,i915/vcs0-busy/,299995125062,100,00,,

"normal activities" being "screensaver and walk away", but I think that's a good approximation of my normal GPU activity (redrawing the terminal a couple times per second).

Not sure I feel daring enough to try those patches. Am I supposed to be able to apply that to 5.3.11 or perhaps compile drm-tip + that as a module for 5.3?
Comment 5 Chris Wilson 2019-11-18 13:17:30 UTC
(In reply to Tomas Janousek from comment #4)
> Not sure I feel daring enough to try those patches. Am I supposed to be able
> to apply that to 5.3.11 or perhaps compile drm-tip + that as a module for
> 5.3?

It's based on our 5.5-tree at present, so, you would have to compile the whole kernel (just use your distro /boot/config-`uname -r`), and it only attempts to enter rc6 faster after activity:

https://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=bug112315

There is still a dependency on the background worker to pick up the pieces if userspace is completely idle, so we need to think of ways of running that more often, cheaply -- kicking it off after a completion event? Maybe tie it into only if rc6 is disabled.

Hmm, I wonder if we can use something like task_work so that we clean up after userspace on a process switch.
Comment 6 Tomas Janousek 2019-11-18 13:31:29 UTC
Oh, okay. I'm not sure I want to be running 5.4-rc on my daily driver, but if time allows, I might at least give it a try and report how it behaves.
Comment 7 Michael Marley 2019-11-19 21:21:13 UTC
I am having this same problem on a Broadwell laptop and a handful of various Skylake systems, so I decided to try https://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=bug112315 (or, precisely, that merged on top of 5.4-rc8 from Linus's tree, which merged cleanly).

While it does allow my systems to reach RC6, it doesn't really seem to make a meaningful difference in power consumption.  It spends <=10% of the time in RC6 if I have any Firefox windows open, for example, even if Firefox isn't actually doing anything.  5.4-rc7 (or 5.4-rc8 with the DoS patch reverted) would have >=95% RC6 under the same conditions.
Comment 8 Tomas Janousek 2019-11-19 21:34:50 UTC
Michael, you might want to give https://patchwork.freedesktop.org/series/69647/ a try, notably https://patchwork.freedesktop.org/patch/341449/?series=69647&rev=2.

(I still didn't get to it but it looks like it might help a bit.)
Comment 9 Chris Wilson 2019-11-19 21:44:36 UTC
That branch was so yesterday, I've just updated with a more aggressive variant [mostly] posted to intel-gfx@

https://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=bug112315&id=dc3a7033dab5ceca5ce43ae09d771951e71a904d
Comment 10 Michael Marley 2019-11-20 03:44:07 UTC
I just tried again with the more aggressive variant and now I am seeing numbers indistinguishable from before the DoS fix.  Thanks!
Comment 11 Michael Marley 2019-11-20 15:35:46 UTC
It looks from https://patchwork.freedesktop.org/patch/341449/?series=69647&rev=2 as if there are still problems with the current approach and something else in that DRM tree seems to be breaking HDMI audio for me, so I'm going to have to switch back to 5.4-rc8 with the DoS commit reverted.  If you have something new that needs testing I can definitely do that though.
Comment 12 dmummenschanz@web.de 2019-11-20 17:14:39 UTC
Any chance you guys will add a kernel parameter to disable this commit and bring back "unsafe" RC6?
Comment 13 David GF 2019-11-24 01:41:03 UTC
Any update on this? Seems like 5.3.11 hit Fedora 30 and 31 stable, so this is affecting all Broadwell and Skylake (particularly laptops) that run it.
I temporarily reverted to 5.3.8, but it seems next stable version for fedora 30/31 is gonna be 5.3.12, which makes me ask: has this version any of the fixes/reverts to this issue?

Thanks!
Comment 14 Tim Richardson 2019-11-24 12:23:06 UTC
It was marked as a security issue and was back-ported by most/all distributions. I think you can assume this will be in all future kernels, it would be a brave distribution that reverted a CVE. I hope they are just as fast with the eventual fix. 


I hate the power impact; I like to feel ok with 6 hours away from power since that happens to me a few times a week. Disabling it is a one line hack to drivers/gpu/drm/i915/i915_drv, e.g.

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index e7b7c5159378..9dd001bf96e6 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2295,7 +2295,7 @@ IS_SUBPLATFORM(const struct drm_i915_private *i915,
 #define HAS_BROKEN_CS_TLB(dev_priv)    (IS_I830(dev_priv) || IS_I845G(dev_priv))
 
 #define NEEDS_RC6_CTX_CORRUPTION_WA(dev_priv)  \
-       (IS_BROADWELL(dev_priv) || IS_GEN(dev_priv, 9))
+       (IS_BROADWELL(dev_priv) || IS_GEN(dev_priv, 999999))
 
 /* WaRsDisableCoarsePowerGating:skl,cnl */
 #define NEEDS_WaRsDisableCoarsePowerGating(dev_priv) \
Comment 15 David GF 2019-11-24 13:27:59 UTC
What's the CVE number?
Comment 16 Tim Richardson 2019-11-24 21:41:51 UTC
this is from debian changelog so I suppose it is this one. 
 * [x86] i915: Mitigate local privilege escalation on gen9 (CVE-2019-0155):
Comment 17 Tim Richardson 2019-11-24 22:16:44 UTC
sorry, I think this one: CVE-2019-0154
Comment 18 Chris Wilson 2019-11-25 14:00:19 UTC
Step 1: 4f88f8747fa4 ("drm/i915/gt: Schedule request retirement when timeline idles")
Comment 19 Chris Wilson 2019-11-25 14:52:58 UTC
https://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=bug112315
contains a backport of softrc6 for v5.4
Comment 20 Michael Marley 2019-11-25 19:14:03 UTC
I tried this on a Skylake system and RC6 does work again.  However, on the first boot with it, the screen locked up completely and I got the output below.  I rebooted again and I haven't seen the problem again yet.

Nov 25 13:58:15 D10a329 kernel: [   75.501215] BUG: unable to handle page fault for address: 0000000000002330
Nov 25 13:58:15 D10a329 kernel: [   75.501218] #PF: supervisor write access in kernel mode
Nov 25 13:58:15 D10a329 kernel: [   75.501219] #PF: error_code(0x0002) - not-present page
Nov 25 13:58:15 D10a329 kernel: [   75.501220] PGD 0 P4D 0 
Nov 25 13:58:15 D10a329 kernel: [   75.501223] Oops: 0002 [#1] PREEMPT SMP PTI
Nov 25 13:58:15 D10a329 kernel: [   75.501225] CPU: 1 PID: 972 Comm: Xorg Tainted: G     U            5.4.0-050400-lowlatency #201911251228
Nov 25 13:58:15 D10a329 kernel: [   75.501226] Hardware name: LENOVO 10FLS33C04/30D0, BIOS FWKTA5A   09/19/2019
Nov 25 13:58:15 D10a329 kernel: [   75.501263] RIP: 0010:gen8_emit_flush_render+0x186/0x1b0 [i915]
Nov 25 13:58:15 D10a329 kernel: [   75.501265] Code: 70 00 00 48 3d 00 f0 ff ff 0f 86 79 ff ff ff e9 28 ff ff ff be 0c 00 00 00 e8 36 70 00 00 48 3d 00 f0 ff ff 0f 87 12 ff ff ff <48> c7 40 08 00 00 00 00 48 83 c0 18 48 c7 40 f8 00 00 00 00 48 c7
Nov 25 13:58:15 D10a329 kernel: [   75.501266] RSP: 0018:ffffac1d80917a10 EFLAGS: 00010207
Nov 25 13:58:15 D10a329 kernel: [   75.501268] RAX: 0000000000002328 RBX: 00000000fffff080 RCX: 0000000000003f90
Nov 25 13:58:15 D10a329 kernel: [   75.501269] RDX: 0000000000002358 RSI: 00000000000000e0 RDI: ffff9915edff9200
Nov 25 13:58:15 D10a329 kernel: [   75.501270] RBP: ffffac1d80917a20 R08: 0000000000000110 R09: ffff991628db69b0
Nov 25 13:58:15 D10a329 kernel: [   75.501271] R10: 000000000000a000 R11: ffff991625d95b00 R12: 0000000001144c1c
Nov 25 13:58:15 D10a329 kernel: [   75.501272] R13: ffff9916256e6800 R14: 0000000000000cc0 R15: ffff99162a5c36c0
Nov 25 13:58:15 D10a329 kernel: [   75.501273] FS:  00007f8db6c86a80(0000) GS:ffff99162ea80000(0000) knlGS:0000000000000000
Nov 25 13:58:15 D10a329 kernel: [   75.501274] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 25 13:58:15 D10a329 kernel: [   75.501275] CR2: 0000000000002330 CR3: 000000042608e001 CR4: 00000000003606e0
Nov 25 13:58:15 D10a329 kernel: [   75.501276] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 25 13:58:15 D10a329 kernel: [   75.501277] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Nov 25 13:58:15 D10a329 kernel: [   75.501278] Call Trace:
Nov 25 13:58:15 D10a329 kernel: [   75.501305]  execlists_request_alloc+0x4a/0x140 [i915]
Nov 25 13:58:15 D10a329 kernel: [   75.501333]  __i915_request_create+0x212/0x270 [i915]
Nov 25 13:58:15 D10a329 kernel: [   75.501360]  i915_request_create+0x7b/0xd0 [i915]
Nov 25 13:58:15 D10a329 kernel: [   75.501387]  i915_gem_do_execbuffer+0x6d3/0xc80 [i915]
Nov 25 13:58:15 D10a329 kernel: [   75.501411]  ? irq_enable.part.0+0x3c/0x40 [i915]
Nov 25 13:58:15 D10a329 kernel: [   75.501415]  ? dma_fence_remove_callback+0x49/0x60
Nov 25 13:58:15 D10a329 kernel: [   75.501441]  ? i915_request_wait+0x1d5/0x3d0 [i915]
Nov 25 13:58:15 D10a329 kernel: [   75.501466]  ? irq_execute_cb+0x30/0x30 [i915]
Nov 25 13:58:15 D10a329 kernel: [   75.501469]  ? __kmalloc_node+0x24b/0x330
Nov 25 13:58:15 D10a329 kernel: [   75.501493]  i915_gem_execbuffer2_ioctl+0x1db/0x3c0 [i915]
Nov 25 13:58:15 D10a329 kernel: [   75.501517]  ? i915_gem_busy_ioctl+0x88/0x1e0 [i915]
Nov 25 13:58:15 D10a329 kernel: [   75.501542]  ? i915_gem_madvise_ioctl+0x176/0x2b0 [i915]
Nov 25 13:58:15 D10a329 kernel: [   75.501566]  ? i915_gem_execbuffer_ioctl+0x2c0/0x2c0 [i915]
Nov 25 13:58:15 D10a329 kernel: [   75.501578]  drm_ioctl_kernel+0xae/0xf0 [drm]
Nov 25 13:58:15 D10a329 kernel: [   75.501588]  drm_ioctl+0x234/0x3d0 [drm]
Nov 25 13:58:15 D10a329 kernel: [   75.501614]  ? i915_gem_execbuffer_ioctl+0x2c0/0x2c0 [i915]
Nov 25 13:58:15 D10a329 kernel: [   75.501617]  do_vfs_ioctl+0x405/0x660
Nov 25 13:58:15 D10a329 kernel: [   75.501620]  ? __fget+0x77/0xa0
Nov 25 13:58:15 D10a329 kernel: [   75.501621]  ksys_ioctl+0x67/0x90
Nov 25 13:58:15 D10a329 kernel: [   75.501623]  __x64_sys_ioctl+0x1a/0x20
Nov 25 13:58:15 D10a329 kernel: [   75.501626]  do_syscall_64+0x57/0x190
Nov 25 13:58:15 D10a329 kernel: [   75.501629]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Nov 25 13:58:15 D10a329 kernel: [   75.501630] RIP: 0033:0x7f8db6fe467b
Nov 25 13:58:15 D10a329 kernel: [   75.501632] Code: 0f 1e fa 48 8b 05 15 28 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e5 27 0d 00 f7 d8 64 89 01 48
Nov 25 13:58:15 D10a329 kernel: [   75.501633] RSP: 002b:00007ffd325c02d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Nov 25 13:58:15 D10a329 kernel: [   75.501635] RAX: ffffffffffffffda RBX: 00007ffd325c0320 RCX: 00007f8db6fe467b
Nov 25 13:58:15 D10a329 kernel: [   75.501635] RDX: 00007ffd325c0320 RSI: 0000000040406469 RDI: 000000000000000e
Nov 25 13:58:15 D10a329 kernel: [   75.501636] RBP: 0000000040406469 R08: 000055c491fbc790 R09: 0000000000000000
Nov 25 13:58:15 D10a329 kernel: [   75.501637] R10: 0000000000000000 R11: 0000000000000246 R12: 000055c491f797c0
Nov 25 13:58:15 D10a329 kernel: [   75.501638] R13: 000000000000000e R14: ffffffffffffffff R15: 00007f8db65cce08
Nov 25 13:58:15 D10a329 kernel: [   75.501640] Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bpfilter br_netfilter bridge stp llc md4 cmac nls_utf8 cifs libarc4 fscache libdes overlay snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio binfmt_misc intel_rapl_msr nls_iso8859_1 intel_rapl_common x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm irqbypass snd_hda_intel snd_intel_nhlt snd_hda_codec snd_usb_audio snd_hda_core snd_usbmidi_lib mc snd_hwdep crct10dif_pclmul snd_pcm crc32_pclmul snd_seq_midi ghash_clmulni_intel snd_seq_midi_event snd_rawmidi mei_hdcp i915 snd_seq aesni_intel drm_kms_helper crypto_simd snd_seq_device cryptd snd_timer glue_helper intel_cstate hid_plantronics input_leds intel_rapl_perf drm snd wmi_bmof joydev i2c_algo_bit intel_wmi_thunderbolt mei_me fb_sys_fops syscopyarea soundcore sysfillrect sysimgblt mei acpi_pad mac_hid nct6683
Nov 25 13:58:15 D10a329 kernel: [   75.501665]  coretemp parport_pc ppdev lp parport iTCO_wdt iTCO_vendor_support ip_tables x_tables autofs4 btrfs xor zstd_compress raid6_pq libcrc32c hid_generic usbhid hid e1000e i2c_i801 wmi ahci libahci video
Nov 25 13:58:15 D10a329 kernel: [   75.501675] CR2: 0000000000002330
Nov 25 13:58:15 D10a329 kernel: [   75.501678] ---[ end trace 7ed3c4bcf4278660 ]---
Nov 25 13:58:15 D10a329 kernel: [   75.501703] RIP: 0010:gen8_emit_flush_render+0x186/0x1b0 [i915]
Nov 25 13:58:15 D10a329 kernel: [   75.501705] Code: 70 00 00 48 3d 00 f0 ff ff 0f 86 79 ff ff ff e9 28 ff ff ff be 0c 00 00 00 e8 36 70 00 00 48 3d 00 f0 ff ff 0f 87 12 ff ff ff <48> c7 40 08 00 00 00 00 48 83 c0 18 48 c7 40 f8 00 00 00 00 48 c7
Nov 25 13:58:15 D10a329 kernel: [   75.501706] RSP: 0018:ffffac1d80917a10 EFLAGS: 00010207
Nov 25 13:58:15 D10a329 kernel: [   75.501707] RAX: 0000000000002328 RBX: 00000000fffff080 RCX: 0000000000003f90
Nov 25 13:58:15 D10a329 kernel: [   75.501708] RDX: 0000000000002358 RSI: 00000000000000e0 RDI: ffff9915edff9200
Nov 25 13:58:15 D10a329 kernel: [   75.501709] RBP: ffffac1d80917a20 R08: 0000000000000110 R09: ffff991628db69b0
Nov 25 13:58:15 D10a329 kernel: [   75.501710] R10: 000000000000a000 R11: ffff991625d95b00 R12: 0000000001144c1c
Nov 25 13:58:15 D10a329 kernel: [   75.501711] R13: ffff9916256e6800 R14: 0000000000000cc0 R15: ffff99162a5c36c0
Nov 25 13:58:15 D10a329 kernel: [   75.501712] FS:  00007f8db6c86a80(0000) GS:ffff99162ea80000(0000) knlGS:0000000000000000
Nov 25 13:58:15 D10a329 kernel: [   75.501713] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 25 13:58:15 D10a329 kernel: [   75.501714] CR2: 0000000000002330 CR3: 000000042608e001 CR4: 00000000003606e0
Nov 25 13:58:15 D10a329 kernel: [   75.501715] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 25 13:58:15 D10a329 kernel: [   75.501716] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Comment 21 Michael Marley 2019-11-25 19:28:40 UTC
Created attachment 146023 [details]
/sys/class/drm/card0/error after a GPU hang

I also just got a GPU hang, the output from which I have attached.
Comment 22 Chris Wilson 2019-11-25 21:36:00 UTC
The individual patches look ok, so it looks like I assumed that v5.4 i915_request_retire() was ready to be called without struct_mutex held. That turns out to be a mistake!

Next iteration at https://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=bug112315

version https://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=bug112315&id=21234379ea5ae5af001539362c01f0888b4cf81a
Comment 23 Michael Marley 2019-11-26 03:03:24 UTC
Thanks!  So far this one is working well so far on a Skylake system and a Broadwell system.  I haven't had a chance to test the specific one that was crashing before, but I will have more information on that tomorrow.
Comment 24 Michael Marley 2019-11-26 18:49:30 UTC
Several hours of testing on the computer where I first encountered the crashes and hangs has also been completely problem free.  With the previous patchset, it would have likely crashed 4-5 times during that period, so it looks fixed to me.  Thanks!
Comment 25 Martin Peres 2019-11-29 19:48:31 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/614.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.