Bug 110943

Summary:	[CI][BAT][guc] igt@(suspend\|s3) - dmesg-warn - Unexpected event: no suitable handler
Product:	DRI	Reporter:	Martin Peres <martin.peres>
Component:	DRM/Intel	Assignee:	Fernando Pacheco <fernando.pacheco>
Status:	RESOLVED FIXED	QA Contact:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity:	normal
Priority:	highest	CC:	intel-gfx-bugs, lakshminarayana.vudum
Version:	XOrg git
Hardware:	Other
OS:	All
Whiteboard:	ReadyForDev
i915 platform:	CFL	i915 features:	firmware/guc

Description Martin Peres 2019-06-18 10:35:40 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3168/fi-cfl-guc/igt@gem_exec_suspend@basic-s3.html

<4> [53.143060] ------------[ cut here ]------------
<6> [53.143070] [drm] CT: disabled
<4> [53.143071] Unexpected event: no suitable handler
<4> [53.143275] WARNING: CPU: 7 PID: 0 at drivers/gpu/drm/i915/intel_guc.c:408 intel_guc_to_host_event_handler_nop+0xc/0x10 [i915]
<4> [53.143279] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 x86_pkg_temp_thermal coretemp mei_hdcp crct10dif_pclmul crc32_pclmul snd_hda_intel snd_hda_codec ghash_clmulni_intel snd_hwdep snd_hda_core e1000e snd_pcm mei_me prime_numbers mei ptp pps_core
<4> [53.143315] CPU: 7 PID: 0 Comm: swapper/7 Tainted: G     U            5.2.0-rc5-CI-CI_DRM_6290+ #1
<4> [53.143319] Hardware name: Micro-Star International Co., Ltd. MS-7B54/Z370M MORTAR (MS-7B54), BIOS 1.10 12/28/2017
<4> [53.143465] RIP: 0010:intel_guc_to_host_event_handler_nop+0xc/0x10 [i915]
<4> [53.143473] Code: 48 c7 c7 97 8c 35 a0 e8 f2 b0 e7 e0 0f 0b b8 ed ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 48 c7 c7 f0 14 33 a0 e8 d4 b0 e7 e0 <0f> 0b c3 90 41 55 41 54 55 53 48 89 fd 80 bf da f9 ff ff 0a 0f 86
<4> [53.143477] RSP: 0018:ffffc9000023ced8 EFLAGS: 00010082
<4> [53.143483] RAX: 0000000000000000 RBX: ffff888252450000 RCX: 0000000000000000
<4> [53.143488] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000ffffffff
<4> [53.143492] RBP: ffffc90000800000 R08: 0000000000000000 R09: 0000000000000001
<4> [53.143496] R10: ffffc9000023cf28 R11: 0000000000000000 R12: 0000000000000020
<4> [53.143500] R13: ffffc9000023cf64 R14: 0000000000000000 R15: ffff8882530a4578
<4> [53.143505] FS:  0000000000000000(0000) GS:ffff888266980000(0000) knlGS:0000000000000000
<4> [53.143509] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [53.143514] CR2: 00007f78d599f3c0 CR3: 0000000005210004 CR4: 00000000003606e0
<4> [53.143518] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4> [53.143522] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
<4> [53.143526] Call Trace:
<4> [53.143533]  <IRQ>
<4> [53.143629]  gen8_irq_handler+0x95/0xd0 [i915]
<4> [53.143646]  __handle_irq_event_percpu+0x41/0x2d0
<4> [53.143653]  ? handle_irq_event+0x27/0x50
<4> [53.143665]  handle_irq_event_percpu+0x2b/0x70
<4> [53.143674]  handle_irq_event+0x2f/0x50
<4> [53.143683]  handle_edge_irq+0xee/0x1a0
<4> [53.143691]  handle_irq+0x17/0x20
<4> [53.143701]  do_IRQ+0x5e/0x130
<4> [53.143712]  common_interrupt+0xf/0xf
<4> [53.143717]  </IRQ>
<4> [53.143726] RIP: 0010:cpuidle_enter_state+0xae/0x450
<4> [53.143733] Code: 44 00 00 31 ff e8 f2 5d 92 ff 45 84 f6 74 12 9c 58 f6 c4 02 0f 85 78 03 00 00 31 ff e8 1b 40 99 ff e8 b6 43 9d ff fb 45 85 ed <0f> 88 c9 02 00 00 4c 2b 24 24 48 ba cf f7 53 e3 a5 9b c4 20 49 63
<4> [53.143737] RSP: 0018:ffffc900000dfe80 EFLAGS: 00000202 ORIG_RAX: ffffffffffffffde
<4> [53.143743] RAX: ffff888264c78040 RBX: ffffffff8229e680 RCX: 0000000000000007
<4> [53.143747] RDX: 0000000000000046 RSI: 0000000000000006 RDI: ffffffff8212e3b1
<4> [53.143751] RBP: ffffe8ffffb80b00 R08: 0000000000000000 R09: 0000000000000000
<4> [53.143755] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000c5f9285ea
<4> [53.143759] R13: 0000000000000004 R14: 0000000000000000 R15: 0000000000000004
<4> [53.143790]  cpuidle_enter+0x24/0x40
<4> [53.143801]  do_idle+0x1f3/0x260
<4> [53.143814]  cpu_startup_entry+0x14/0x20
<4> [53.143824]  start_secondary+0x173/0x1c0
<4> [53.143836]  secondary_startup_64+0xa4/0xb0
<4> [53.143858] irq event stamp: 10752724
<4> [53.143866] hardirqs last  enabled at (10752721): [<ffffffff817c1afa>] cpuidle_enter_state+0xaa/0x450
<4> [53.143872] hardirqs last disabled at (10752722): [<ffffffff810019e0>] trace_hardirqs_off_thunk+0x1a/0x1c
<4> [53.143881] softirqs last  enabled at (10752724): [<ffffffff810b62f8>] irq_enter+0x58/0x60
<4> [53.143888] softirqs last disabled at (10752723): [<ffffffff810b62dd>] irq_enter+0x3d/0x60
<4> [53.144033] WARNING: CPU: 7 PID: 0 at drivers/gpu/drm/i915/intel_guc.c:408 intel_guc_to_host_event_handler_nop+0xc/0x10 [i915]
<4> [53.144038] ---[ end trace 1b251ac1cfe3f517 ]---

Comment 1 CI Bug Log 2019-06-18 10:36:12 UTC

The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* GUC: suspend tests - dmesg-warn - Unexpected event: no suitable handler
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3168/fi-cfl-guc/igt@gem_exec_suspend@basic-s3.html

Comment 2 CI Bug Log 2019-06-18 10:38:10 UTC

The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* GUC: igt@runner@aborted - fail - Previous test: .*(suspend|s3).*
  - https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13196/fi-cfl-guc/igt@runner@aborted.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3168/fi-cfl-guc/igt@runner@aborted.html

Comment 3 Daniele Ceraolo Spurio 2019-06-18 18:30:48 UTC

Michal pointed out that in guc_disable_communication() we set the function pointer to the _nop function inside intel_guc_ct_disable() and therefore before guc_disable_interrupts(), so this issue might be caused by a log flush interrupt between those 2 calls.
Simply flipping the order of the 2 calls should fix the warning, but we'll lose the flush event. However, after CTBs are disabled the guc events are logged in the SCRATCH_15 register, so we should be able to get it from there. I'll check with Michal if he has any issues with this approach or any better idea and if he hasn't I'll prepare a patch.

Comment 4 CI Bug Log 2019-06-19 07:29:40 UTC

A CI Bug Log filter associated to this bug has been updated:

{- GUC: suspend tests - dmesg-warn - Unexpected event: no suitable handler -}
{+ GUC: suspend tests / igt@i915_selftest@live_* - dmesg-warn - Unexpected event: no suitable handler +}


  No new failures caught with the new filter

Comment 5 CI Bug Log 2019-06-19 07:30:05 UTC

A CI Bug Log filter associated to this bug has been updated:

{- GUC: igt@runner@aborted - fail - Previous test: .*(suspend|s3).* -}
{+ GUC: igt@runner@aborted - fail - Previous test: .*(suspend|s3|live_).* +}


  No new failures caught with the new filter

Comment 6 Caz.Yokoyama 2019-06-20 14:37:07 UTC

Soon will be resolved/fixed?

Comment 7 Daniele Ceraolo Spurio 2019-06-21 17:19:35 UTC

Patches to fix this are in flight (https://patchwork.freedesktop.org/series/62412/), just a couple of minor comments to address

Comment 8 Chris Wilson 2019-06-21 19:30:34 UTC

commit e29cc1d7e820d662d2d73b35281b70f01a24610b
Author: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Date:   Fri Jun 21 11:21:22 2019 -0700

    drm/i915/guc: reorder enable/disable communication steps
    
    Make sure we always have CT buffers enabled when the interrupts are
    enabled, so we can always handle interrupts from GuC. Also move the
    setting of the guc->send and guc->handler functions to the GuC
    communication control functions for consistency.
    
    The reorder also fixes the onion unwinding of intel_uc_init_hw, because
    guc_enable_communication would've left interrupts enabled when failing
    to enable CTB.
    
    v2: always retunr the result of ctch_enable() in
        intel_guc_ct_enable() (Michal)
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110943
    Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
    Cc: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
    Reviewed-by: Matthew Brost <matthew.brost@intel.com>
    Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Link: https://patchwork.freedesktop.org/patch/msgid/20190621182123.31368-1-daniele.ceraolospurio@intel.com

Comment 9 Chris Wilson 2019-06-27 16:37:12 UTC

*** Bug 110910 has been marked as a duplicate of this bug. ***

Comment 10 Chris Wilson 2019-07-17 16:34:34 UTC

Happened again in CI_DRM_6499

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6499/fi-icl-dsi/igt@i915_pm_rpm@module-reload.html

<4> [332.218126] Unexpected event: no suitable handler
<4> [332.218288] WARNING: CPU: 4 PID: 0 at drivers/gpu/drm/i915/gt/uc/intel_guc.c:325 intel_guc_to_host_event_handler_nop+0xc/0x10 [i915]
<4> [332.218293] Modules linked in: i915(-) vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic mei_hdcp x86_pkg_temp_thermal coretemp ax88179_178a usbnet crct10dif_pclmul mii crc32_pclmul snd_hda_codec e1000e snd_hwdep snd_hda_core ghash_clmulni_intel ptp pps_core snd_pcm mei_me mei prime_numbers [last unloaded: snd_hda_intel]
<4> [332.218322] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G     U            5.2.0-CI-CI_DRM_6499+ #1
<4> [332.218325] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake Y LPDDR4x T4 RVP TLC, BIOS ICLSFWR1.R00.3102.A00.1903052247 03/05/2019
<4> [332.218417] RIP: 0010:intel_guc_to_host_event_handler_nop+0xc/0x10 [i915]
<4> [332.218421] Code: 48 c7 c7 2f 85 98 a0 e8 92 43 85 e0 0f 0b b8 ed ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 48 c7 c7 b0 09 96 a0 e8 74 43 85 e0 <0f> 0b c3 90 41 54 55 48 89 fd 53 48 8b 47 e8 80 b8 da 0b 00 00 0a
<4> [332.218425] RSP: 0018:ffffc900001fcea0 EFLAGS: 00010086
<4> [332.218429] RAX: 0000000000000000 RBX: ffffc90009000000 RCX: 0000000000000000
<4> [332.218432] RDX: 0000000000000007 RSI: 0000000000000000 RDI: 00000000ffffffff
<4> [332.218435] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000001
<4> [332.218438] R10: ffffc900001fce38 R11: 0000000000000000 R12: 0000000000000000
<4> [332.218441] R13: 0000000000000000 R14: 0000000000000019 R15: ffff88827b270000
<4> [332.218445] FS:  0000000000000000(0000) GS:ffff88829be00000(0000) knlGS:0000000000000000
<4> [332.218448] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [332.218452] CR2: 00007f50551b8c38 CR3: 000000024a894006 CR4: 0000000000760ee0
<4> [332.218455] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4> [332.218458] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
<4> [332.218461] PKRU: 55555554
<4> [332.218463] Call Trace:
<4> [332.218468]  <IRQ>
<4> [332.218525]  gen11_irq_handler+0x2f5/0x420 [i915]
<4> [332.218540]  __handle_irq_event_percpu+0x41/0x2d0
<4> [332.218544]  ? handle_irq_event+0x27/0x50
<4> [332.218552]  handle_irq_event_percpu+0x2b/0x70
<4> [332.218557]  handle_irq_event+0x2f/0x50
<4> [332.218564]  handle_edge_irq+0xee/0x1a0
<4> [332.218570]  handle_irq+0x17/0x20
<4> [332.218577]  do_IRQ+0x5e/0x130
<4> [332.218584]  common_interrupt+0xf/0xf
<4> [332.218588]  </IRQ>
<4> [332.218594] RIP: 0010:cpuidle_enter_state+0xae/0x450
<4> [332.218598] Code: 44 00 00 31 ff e8 52 87 92 ff 45 84 f6 74 12 9c 58 f6 c4 02 0f 85 78 03 00 00 31 ff e8 7b 69 99 ff e8 76 6d 9d ff fb 45 85 ed <0f> 88 c9 02 00 00 4c 2b 24 24 48 ba cf f7 53 e3 a5 9b c4 20 49 63
<4> [332.218601] RSP: 0018:ffffc9000012be80 EFLAGS: 00000202 ORIG_RAX: ffffffffffffffdc
<4> [332.218605] RAX: ffff88829a09c040 RBX: ffffffff822a0a60 RCX: 0000000000000004
<4> [332.218608] RDX: 0000000000000046 RSI: 0000000000000006 RDI: ffffffff8212d6d1
<4> [332.218611] RBP: ffff88829720d508 R08: 0000000000000000 R09: 0000000000000000
<4> [332.218614] R10: 0000000000000000 R11: 0000000000000000 R12: 0000004d59b7ee90
<4> [332.218616] R13: 0000000000000002 R14: 0000000000000000 R15: 0000000000000002
<4> [332.218637]  cpuidle_enter+0x24/0x40
<4> [332.218654]  do_idle+0x1f3/0x260
<4> [332.218663]  cpu_startup_entry+0x14/0x20
<4> [332.218669]  start_secondary+0x173/0x1c0
<4> [332.218677]  secondary_startup_64+0xa4/0xb0
<4> [332.218692] irq event stamp: 24479052
<4> [332.218697] hardirqs last  enabled at (24479049): [<ffffffff817bf2ba>] cpuidle_enter_state+0xaa/0x450
<4> [332.218701] hardirqs last disabled at (24479050): [<ffffffff810019f0>] trace_hardirqs_off_thunk+0x1a/0x1c
<4> [332.218707] softirqs last  enabled at (24479052): [<ffffffff810b6408>] irq_enter+0x58/0x60
<4> [332.218712] softirqs last disabled at (24479051): [<ffffffff810b63ed>] irq_enter+0x3d/0x60
<4> [332.218744] WARNING: CPU: 4 PID: 0 at drivers/gpu/drm/i915/gt/ucb/intel_guc.c:325 intel_guc_to_host_event_handler_nop+0xc/0x10 [i915]

Comment 11 Daniele Ceraolo Spurio 2019-07-17 16:51:53 UTC

Looks like a different issue. I believe the problem here is due to a known issue in the unload path where we call uc_sanitize before we call intel_uc_fini_hw and therefore we hit the !intel_guc_is_loaded(guc) check in the latter, thus skipping the guc_disable_communication(). We saw this issue with guc submission enabled (because we skip the intel_guc_submission_disable call) and Fernando was looking at fixing it, we just didn't realize there was an implication on the non-submission path as well.

Comment 12 Chris Wilson 2019-07-18 09:55:21 UTC

*** Bug 111164 has been marked as a duplicate of this bug. ***

Comment 13 CI Bug Log 2019-07-18 10:55:21 UTC

A CI Bug Log filter associated to this bug has been updated:

{- GUC: suspend tests / igt@i915_selftest@live_* - dmesg-warn - Unexpected event: no suitable handler -}
{+ ICL GUC: reload/suspend tests / igt@i915_selftest@live_* - dmesg-warn - Unexpected event: no suitable handler +}


  No new failures caught with the new filter

Comment 14 Lakshmi 2019-08-20 13:14:45 UTC

Fernando, this is an highest priority bug which needs an update everyday (SLA for highest priority bug is 1 day). 

Any updates here? Do you agree with the priority of this issue? What are next steps?

Comment 15 Fernando Pacheco 2019-08-20 15:30:00 UTC

Sorry for the slowness in updating. Patch to address this is currently under review: https://patchwork.freedesktop.org/patch/324196/

Comment 16 Jon Ewins 2019-08-29 14:34:19 UTC

Updated patchset in review: https://patchwork.freedesktop.org/series/65901/

Comment 17 Chris Wilson 2019-08-29 18:49:57 UTC

commit 9be02fde93e5b564663f10a48f19159fba718e3d (HEAD -> drm-intel-next-queued, drm-intel/for-linux-next, drm-intel/drm-intel-next-queued)
Author: Fernando Pacheco <fernando.pacheco@intel.com>
Date:   Thu Aug 29 10:41:53 2019 -0700

    drm/i915/uc: Extract common code from GuC stop/disable comm
    
    During normal driver unload we attempt to disable GuC communication
    while it is currently stopped. This results in a nop'd call to
    intel_guc_ct_disable within guc_disable_communication because
    stop/disable rely on the same flag to prevent further comms with CT.
    
    We can avoid the call to disable and still leave communication in a
    satisfactory state by extracting a set of shared steps from stop/disable.
    This set can include guc_disable_interrupts as we do not require the
    single caller of guc_stop_communication to be atomic:
    "drm/i915/selftests: Fixup atomic reset checking".
    
    This situation (stop -> disable) only occurs during intel_uc_fini_hw,
    so during fini, call guc_disable_communication only if currently enabled.
    The symmetric calls to enable/disable remain unmodified for all other
    scenarios.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110943
    Signed-off-by: Fernando Pacheco <fernando.pacheco@intel.com>
    Cc: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
    Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
    Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Link: https://patchwork.freedesktop.org/patch/msgid/20190829174154.14675-1-fernando.pacheco@intel.com

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.