Bug 102037

Summary:	[KBL][IGT] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:539!
Product:	DRI	Reporter:	Tomi Sarvela <tomi.p.sarvela>
Component:	DRM/Intel	Assignee:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Status:	CLOSED FIXED	QA Contact:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity:	major
Priority:	medium	CC:	intel-gfx-bugs, matwey.kornilov, tomislav.ivek
Version:	XOrg git
Hardware:	x86-64 (AMD64)
OS:	Linux (All)
Whiteboard:	ReadyForDev
i915 platform:	KBL	i915 features:	power/runtime PM

Description Tomi Sarvela 2017-08-04 14:45:16 UTC

Boom on Intel Kaby Lake NUC7i5BNH with IGT-testing. Recent (todays) DRM-Tip with lot of debug on.

Short version:
ickle> first 2, file a bug
ickle> context-switch interrupt when waking up before we even have submitted anything?

<6>[   33.522783] PM: late suspend of devices complete after 16.768 msecs
<6>[   33.554237] PM: noirq suspend of devices complete after 31.445 msecs
<6>[   33.555474] ACPI: Preparing to enter system sleep state S3
<6>[   33.564375] ACPI: EC: event blocked
<6>[   33.564378] ACPI: EC: EC stopped
<6>[   33.564380] PM: Saving platform NVS memory
<6>[   33.564716] Disabling non-boot CPUs ...
<6>[   33.578858] smpboot: CPU 1 is now offline
<6>[   33.591791] smpboot: CPU 2 is now offline
<4>[   33.601468] IRQ 8: no longer affine to CPU3
<4>[   33.601481] IRQ 9: no longer affine to CPU3
<4>[   33.601493] IRQ 126: no longer affine to CPU3
<4>[   33.601499] IRQ 132: no longer affine to CPU3
<6>[   33.602611] smpboot: CPU 3 is now offline
<6>[   33.606640] ACPI: Low-level resume complete
<6>[   33.606831] ACPI: EC: EC started
<6>[   33.606832] PM: Restoring platform NVS memory
<6>[   33.607698] Suspended for 16.777 seconds
<6>[   33.607755] Enabling non-boot CPUs ...
<6>[   33.607803] x86: Booting SMP configuration:
<6>[   33.607804] smpboot: Booting Node 0 Processor 1 APIC 0x2
<4>[   33.609060]  cache: parent cpu1 should not be sleeping
<6>[   33.609563] CPU1 is up
<6>[   33.609587] smpboot: Booting Node 0 Processor 2 APIC 0x1
<4>[   33.610591]  cache: parent cpu2 should not be sleeping
<6>[   33.611006] CPU2 is up
<6>[   33.611029] smpboot: Booting Node 0 Processor 3 APIC 0x3
<4>[   33.611871]  cache: parent cpu3 should not be sleeping
<6>[   33.612319] CPU3 is up
<6>[   33.615531] ACPI: Waking up from system sleep state S3
<6>[   33.685469] PM: noirq resume of devices complete after 25.560 msecs
<7>[   33.685673] [drm:gen9_set_dc_state [i915]] Setting DC state from 00 to 00
<7>[   33.685722] [drm:intel_power_well_enable [i915]] enabling power well 1
<7>[   33.685768] [drm:intel_power_well_enable [i915]] enabling MISC IO power well
<7>[   33.685824] [drm:skl_init_cdclk [i915]] Sanitizing cdclk programmed by pre-os
<7>[   33.687722] [drm:intel_update_cdclk [i915]] Current CD clock rate: 337500 kHz, VCO: 8100000 kHz, ref: 24000 kHz
<7>[   33.688549] [drm:intel_power_well_enable [i915]] enabling always-on
<7>[   33.688576] [drm:intel_power_well_enable [i915]] enabling DC off
<7>[   33.688604] [drm:gen9_set_dc_state [i915]] Setting DC state from 00 to 00
<7>[   33.688636] [drm:intel_power_well_enable [i915]] enabling power well 2
<7>[   33.688674] [drm:intel_power_well_enable [i915]] enabling DDI A/E IO power well
<7>[   33.688702] [drm:intel_power_well_enable [i915]] enabling DDI B IO power well
<7>[   33.688729] [drm:intel_power_well_enable [i915]] enabling DDI C IO power well
<7>[   33.688757] [drm:intel_power_well_enable [i915]] enabling DDI D IO power well
<6>[   33.691248] PM: early resume of devices complete after 5.736 msecs
<7>[   33.691480] [drm:intel_opregion_setup [i915]] graphic opregion physical addr: 0x7af8a018
<7>[   33.691562] [drm:intel_opregion_setup [i915]] Public ACPI methods supported
<7>[   33.691612] [drm:intel_opregion_setup [i915]] SWSCI supported
<6>[   33.691762] ACPI: EC: event unblocked
<7>[   33.698713] [drm:intel_opregion_setup [i915]] SWSCI GBDA callbacks 00000cb3, SBCB callbacks 00300483
<7>[   33.698755] [drm:intel_opregion_setup [i915]] ASLE supported
<7>[   33.698793] [drm:intel_opregion_setup [i915]] ASLE extension supported
<7>[   33.698831] [drm:intel_opregion_setup [i915]] Found valid VBT in ACPI OpRegion (Mailbox #4)
<7>[   33.699372] [drm:lspcon_wake_native_aux_ch [i915]] Native AUX CH up, DPCD version: 15.14
<7>[   33.772767] [drm:lspcon_resume [i915]] LSPCON recovering in PCON mode after 73 ms
<7>[   33.773204] [drm:drm_dp_i2c_do_msg] native defer
<7>[   33.774375] [drm:drm_dp_i2c_do_msg] native defer
<7>[   33.775592] [drm:drm_dp_i2c_do_msg] native defer
<7>[   33.776796] [drm:drm_dp_i2c_do_msg] native defer
<7>[   33.778014] [drm:drm_dp_i2c_do_msg] native defer
<7>[   33.778813] [drm:lspcon_wait_mode [i915]] Current LSPCON mode PCON
<7>[   33.779222] [drm:gen8_init_common_ring [i915]] Execlists enabled for rcs0
<7>[   33.779244] [drm:gen8_init_common_ring [i915]] Restarting rcs0:0 from 0x73
<7>[   33.779282] [drm:init_workarounds_ring [i915]] rcs0: Number of context specific w/a: 17
<7>[   33.779352] [drm:gen8_init_common_ring [i915]] Execlists enabled for bcs0
<4>[   33.779409] ------------[ cut here ]------------
<2>[   33.779410] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:539!
<4>[   33.779412] invalid opcode: 0000 [#1] PREEMPT SMP
<4>[   33.779413] Modules linked in: snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_intel e1000e snd_hda_codec ptp snd_hwdep snd_hda_core pps_core snd_pcm mei_me mei prime_numbers pinctrl_sunrisepoint pinctrl_intel i2c_hid
<4>[   33.779429] CPU: 0 PID: 1335 Comm: kworker/u8:50 Tainted: G        W       4.13.0-rc3-CI-CI_DRM_2919+ #1
<4>[   33.779430] Hardware name:                  /NUC7i5BNB, BIOS BNKBL357.86A.0048.2017.0704.1415 07/04/2017
<4>[   33.779433] Workqueue: events_unbound async_run_entry_fn
<4>[   33.779434] task: ffff88025ef80040 task.stack: ffffc90001e74000
<4>[   33.779454] RIP: 0010:intel_lrc_irq_handler+0x25e/0x500 [i915]
<4>[   33.779455] RSP: 0018:ffff88027ec03f08 EFLAGS: 00010246
<4>[   33.779456] RAX: 0000000000000000 RBX: ffff88026bce4588 RCX: 0000000000000000
<4>[   33.779457] RDX: 0000000080010001 RSI: ffffffff81c9a92b RDI: ffff88026bce42a8
<4>[   33.779458] RBP: ffff88027ec03f30 R08: ffff88026cc30000 R09: 0000000000000000
<4>[   33.779459] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88026bce4590
<4>[   33.779459] R13: 0000000000000000 R14: ffffffff81cfbae7 R15: 0000000000000000
<4>[   33.779460] FS:  0000000000000000(0000) GS:ffff88027ec00000(0000) knlGS:0000000000000000
<4>[   33.779461] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[   33.779462] CR2: 000000edb033ad80 CR3: 0000000003e0f000 CR4: 00000000003406f0
<4>[   33.779462] Call Trace:
<4>[   33.779463]  <IRQ>
<4>[   33.779466]  ? tasklet_hi_action+0x93/0x120
<4>[   33.779468]  __do_softirq+0xbb/0x4b0
<4>[   33.779471]  irq_exit+0xa9/0xc0
<4>[   33.779472]  do_IRQ+0x6c/0x130
<4>[   33.779475]  common_interrupt+0x90/0x90
<4>[   33.779476] RIP: 0010:_raw_spin_unlock_irq+0x2d/0x50
<4>[   33.779477] RSP: 0018:ffffc90001e77c30 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff6d
<4>[   33.779478] RAX: ffffffffffffffff RBX: ffff88025e030060 RCX: 0000000000000000
<4>[   33.779479] RDX: ffffffffa01d2e9a RSI: 0000000000000001 RDI: 0000000000000001
<4>[   33.779479] RBP: ffffc90001e77c38 R08: 0000000000000019 R09: 0000000000000000
<4>[   33.779480] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88025e030060
<4>[   33.779481] R13: ffff88025e030010 R14: 0000000000000000 R15: ffffffff81ccf32f
<4>[   33.779482]  </IRQ>
<4>[   33.779502]  ? intel_engine_reset_breadcrumbs+0x4a/0x60 [i915]
<4>[   33.779521]  intel_engine_reset_breadcrumbs+0x4a/0x60 [i915]
<4>[   33.779540]  gen8_init_common_ring+0x39/0x150 [i915]
<4>[   33.779558]  i915_gem_init_hw+0xce/0x2c0 [i915]
<4>[   33.779571]  i915_pm_restore+0x90/0x190 [i915]
<4>[   33.779584]  i915_pm_resume+0x9/0x10 [i915]
<4>[   33.779586]  pci_pm_resume+0x5f/0x90
<4>[   33.779588]  dpm_run_callback+0x6a/0x310
<4>[   33.779590]  ? pci_pm_freeze+0xe0/0xe0
<4>[   33.779592]  device_resume+0xac/0x1e0
<4>[   33.779593]  ? dpm_watchdog_set+0x60/0x60
<4>[   33.779595]  async_resume+0x18/0x40
<4>[   33.779597]  async_run_entry_fn+0x33/0x160
<4>[   33.779599]  process_one_work+0x21f/0x630
<4>[   33.779601]  worker_thread+0x49/0x3b0
<4>[   33.779603]  kthread+0x10f/0x150
<4>[   33.779604]  ? process_one_work+0x630/0x630
<4>[   33.779605]  ? kthread_create_on_node+0x40/0x40
<4>[   33.779607]  ret_from_fork+0x27/0x40
<4>[   33.779609] Code: 83 e5 08 0f 85 d8 fe ff ff 0f 0b 0f 0b 0f 0b 48 89 cf 4c 89 45 c0 4c 89 55 c8 e8 7e 8c 40 e1 4c 8b 45 c0 4c 8b 55 c8 eb 92 0f 0b <0f> 0b 49 8d 84 24 40 03 00 00 48 83 e2 fc 48 89 45 a8 74 14 8b 
<1>[   33.779658] RIP: intel_lrc_irq_handler+0x25e/0x500 [i915] RSP: ffff88027ec03f08
<4>[   33.779663] ---[ end trace e6fc4c5de8ebe67b ]---

Long version:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_2919/shard-kbl4/dmesg-1501855809_Oops_1.log
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_2919/shard-kbl4/dmesg-1501855809_Panic_2.log

Comment 1 Chris Wilson 2017-08-04 14:52:47 UTC

Short story is that we are resuming with requests still in the ELSP:

[   33.779244] [drm:gen8_init_common_ring [i915]] Restarting rcs0:0 from 0x73

We shouldn't be sleeping with ELSP unfinished...

Comment 2 Chris Wilson 2017-08-04 14:55:27 UTC

Ah, we did go to sleep too early:

<3>[   24.870045] [drm:i915_gem_idle_work_handler [i915]] *ERROR* Timeout waiting for engines to idle

<4>[   33.335633] WARN_ON(!intel_engines_are_idle(dev_priv))
<4>[   33.335678] ------------[ cut here ]------------
<4>[   33.335772] WARNING: CPU: 1 PID: 1316 at drivers/gpu/drm/i915/i915_gem.c:4573 i915_gem_suspend+0x112/0x130 [i915]
<4>[   33.335774] Modules linked in: snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_intel e1000e snd_hda_codec ptp snd_hwdep snd_hda_core pps_core snd_pcm mei_me mei prime_numbers pinctrl_sunrisepoint pinctrl_intel i2c_hid
<4>[   33.335842] CPU: 1 PID: 1316 Comm: kworker/u8:31 Tainted: G        W       4.13.0-rc3-CI-CI_DRM_2919+ #1
<4>[   33.335846] Hardware name:                  /NUC7i5BNB, BIOS BNKBL357.86A.0048.2017.0704.1415 07/04/2017
<4>[   33.335856] Workqueue: events_unbound async_run_entry_fn
<4>[   33.335863] task: ffff88026f16a840 task.stack: ffffc90001ddc000
<4>[   33.335951] RIP: 0010:i915_gem_suspend+0x112/0x130 [i915]
<4>[   33.335955] RSP: 0018:ffffc90001ddfc80 EFLAGS: 00010282
<4>[   33.335962] RAX: 000000000000002a RBX: ffff88026cc30000 RCX: 0000000000000000
<4>[   33.335966] RDX: 0000000080000001 RSI: 0000000000000001 RDI: ffffffff810ebd71
<4>[   33.335969] RBP: ffffc90001ddfc98 R08: 0000000000000001 R09: 0000000000000000
<4>[   33.335972] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
<4>[   33.335976] R13: ffff88026cc380b8 R14: 0000000000000000 R15: ffffffff81ccf32f
<4>[   33.335981] FS:  0000000000000000(0000) GS:ffff88027ec80000(0000) knlGS:0000000000000000
<4>[   33.335985] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[   33.335989] CR2: 00007f140af8db04 CR3: 0000000003e0f000 CR4: 00000000003406e0
<4>[   33.335992] Call Trace:
<4>[   33.336066]  i915_pm_suspend+0x81/0x1a0 [i915]
<4>[   33.336077]  pci_pm_suspend+0x73/0x140
<4>[   33.336087]  dpm_run_callback+0x6a/0x310
<4>[   33.336094]  ? pci_pm_resume+0x90/0x90
<4>[   33.336102]  __device_suspend+0xfd/0x380
<4>[   33.336108]  ? check_preempt_curr+0x87/0xb0
<4>[   33.336116]  ? dpm_watchdog_set+0x60/0x60
<4>[   33.336126]  async_suspend+0x1a/0x90
<4>[   33.336134]  async_run_entry_fn+0x33/0x160
<4>[   33.336142]  process_one_work+0x21f/0x630
<4>[   33.336152]  worker_thread+0x49/0x3b0
<4>[   33.336161]  kthread+0x10f/0x150
<4>[   33.336167]  ? process_one_work+0x630/0x630
<4>[   33.336172]  ? kthread_create_on_node+0x40/0x40
<4>[   33.336180]  ret_from_fork+0x27/0x40
<4>[   33.336190] Code: 74 16 48 89 df e8 cf fe ff ff e9 59 ff ff ff 48 83 7a 08 00 74 8b 0f 0b 48 c7 c6 68 a1 2a a0 48 c7 c7 df 7b 29 a0 e8 da 36 f3 e0 <0f> ff eb d3 48 c7 c6 9c 7c 29 a0 48 c7 c7 df 7b 29 a0 e8 c3 36
<4>[   33.336341] ---[ end trace e6fc4c5de8ebe67a ]---

Comment 3 Chris Wilson 2017-08-29 17:18:46 UTC

commit f36325f3789c1cf7f0d795ff180cade25ec3a586
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat Aug 26 12:09:34 2017 +0100

    drm/i915: Clear wedged status upon resume
    
    When we wake up from suspend, the device has been powered down and
    should come back afresh. We should be able to safely remove the wedged
    status from the previous session and start afresh.

commit fc692bd31bc9dd17c7cc59abdb514a58964fc2a7
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat Aug 26 12:09:35 2017 +0100

    drm/i915: Discard the request queue if we fail to sleep before suspend
    
    If we fail to clear the outstanding request queue before suspending,
    mark those requests as lost.

will prevent the BUG after resuming, but leave the underlying issue unresolved. At least it is lessening the severity of the bug.

Comment 4 Elizabeth 2017-10-16 21:09:06 UTC

Sorry if this is SPAM, not sure if this is related to this bug: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3241/shard-kbl6/dmesg-1508176798_Panic_1.log

Comment 5 Chris Wilson 2017-10-16 21:13:10 UTC

(In reply to Elizabeth from comment #4)
> Sorry if this is SPAM, not sure if this is related to this bug:
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3241/shard-kbl6/dmesg-
> 1508176798_Panic_1.log

No, that's mmio death on kbl. (Working theory is dmc related.) This here was just suspending too early (marking ourselves idle before the context switch interrupt arrived.)

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.