Bug 109536 - [CI][DRMTIP] igt@i915_suspend@debugfs-reader - dmesg-warn - Unexpected send: action=0x30
Summary: [CI][DRMTIP] igt@i915_suspend@debugfs-reader - dmesg-warn - Unexpected send: ...
Status: REOPENED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: medium normal
Assignee: Robert M. Fosha
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
: 109537 (view as bug list)
Depends on:
Blocks:
 
Reported: 2019-02-03 12:41 UTC by Martin Peres
Modified: 2019-09-10 23:08 UTC (History)
2 users (show)

See Also:
i915 platform: KBL
i915 features: firmware/guc


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Peres 2019-02-03 12:41:38 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_207/fi-kbl-guc/igt@i915_suspend@debugfs-reader.html

<4> [46.791873] ------------[ cut here ]------------
<4> [46.791881] Unexpected send: action=0x30
<4> [46.791917] WARNING: CPU: 6 PID: 397 at drivers/gpu/drm/i915/intel_guc.c:373 intel_guc_send_nop+0xe/0x20 [i915]
<4> [46.791918] Modules linked in: snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 x86_pkg_temp_thermal snd_hda_intel coretemp snd_hda_codec snd_hwdep crct10dif_pclmul snd_hda_core crc32_pclmul ghash_clmulni_intel snd_pcm e1000e mei_me mei i2c_i801 prime_numbers
<4> [46.791927] CPU: 6 PID: 397 Comm: kworker/u17:0 Tainted: G     U            5.0.0-rc4-gfd25766ba02f-drmtip_207+ #1
<4> [46.791928] Hardware name: System manufacturer System Product Name/Z170M-PLUS, BIOS 3610 03/29/2018
<4> [46.791950] Workqueue: i915-guc_log capture_logs_work [i915]
<4> [46.791972] RIP: 0010:intel_guc_send_nop+0xe/0x20 [i915]
<4> [46.791974] Code: 01 00 00 00 be c8 c4 00 00 e9 8e 16 95 e9 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 8b 36 48 c7 c7 82 0c 39 c0 e8 a2 22 e0 e8 <0f> 0b b8 ed ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 48 c7 c7 f8
<4> [46.791975] RSP: 0018:ffffa390003dfdb8 EFLAGS: 00010286
<4> [46.791976] RAX: 0000000000000000 RBX: 00000000a0600001 RCX: 0000000000000000
<4> [46.791977] RDX: 0000000000000007 RSI: ffffffffaa0878eb RDI: 00000000ffffffff
<4> [46.791979] RBP: 0000000000013000 R08: 000000000d47896a R09: 0000000000000000
<4> [46.791980] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000002000
<4> [46.791981] R13: ffffa390008d1060 R14: ffff93f0cee11380 R15: ffff93f0cee10000
<4> [46.791982] FS:  0000000000000000(0000) GS:ffff93f0eeb80000(0000) knlGS:0000000000000000
<4> [46.791983] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [46.791984] CR2: 00007f64864fb788 CR3: 00000000b0214002 CR4: 00000000003606e0
<4> [46.791985] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4> [46.791986] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
<4> [46.791987] Call Trace:
<4> [46.792009]  guc_log_capture_logs+0x2d9/0x460 [i915]
<4> [46.792016]  process_one_work+0x245/0x610
<4> [46.792020]  worker_thread+0x1d0/0x380
<4> [46.792022]  ? process_one_work+0x610/0x610
<4> [46.792024]  kthread+0x119/0x130
<4> [46.792026]  ? kthread_park+0x80/0x80
<4> [46.792029]  ret_from_fork+0x3a/0x50
<4> [46.792034] irq event stamp: 98
<4> [46.792037] hardirqs last  enabled at (97): [<ffffffffa9126f0e>] console_unlock+0x3fe/0x600
<4> [46.792038] hardirqs last disabled at (98): [<ffffffffa90019b0>] trace_hardirqs_off_thunk+0x1a/0x1c
<4> [46.792040] softirqs last  enabled at (84): [<ffffffffa9c0033a>] __do_softirq+0x33a/0x4b9
<4> [46.792042] softirqs last disabled at (73): [<ffffffffa90b9a41>] irq_exit+0xd1/0xe0
<4> [46.792063] WARNING: CPU: 6 PID: 397 at drivers/gpu/drm/i915/intel_guc.c:373 intel_guc_send_nop+0xe/0x20 [i915]
<4> [46.792064] ---[ end trace 7c1cf8091b0b22a9 ]---
Comment 1 Chris Wilson 2019-02-03 16:36:17 UTC
*** Bug 109537 has been marked as a duplicate of this bug. ***
Comment 2 Chris Wilson 2019-02-04 08:54:10 UTC
The guc is in a reset state (from an earlier failure leaving the driver wedged) and not able to respond. Either this should not be a warn, or the logging should be more intelligent about when guc is disabled.
Comment 3 Chris Wilson 2019-02-26 10:45:50 UTC
Worth a shot to see if this is fixed by

commit 07c100b187332101220baf7446b4f09296d7c59b
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Feb 21 16:38:33 2019 +0000

    drm/i915/guc: Flush the residual log capture irq on disabling
    
    As we disable the log capture events, flush any residual interrupt
    before we flush and disable the worker.
    
    v2: Mika pointed out that it wasn't the worker re-queueing itself, but a
    rogue irq.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109716
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20190221163833.21393-1-chris@chris-wilson.co.uk
Comment 4 Martin Peres 2019-03-06 17:27:45 UTC
(In reply to Chris Wilson from comment #3)
> Worth a shot to see if this is fixed by
> 
> commit 07c100b187332101220baf7446b4f09296d7c59b
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Thu Feb 21 16:38:33 2019 +0000
> 
>     drm/i915/guc: Flush the residual log capture irq on disabling
>     
>     As we disable the log capture events, flush any residual interrupt
>     before we flush and disable the worker.
>     
>     v2: Mika pointed out that it wasn't the worker re-queueing itself, but a
>     rogue irq.
>     
>     Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109716
>     Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>     Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>     Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>     Link:
> https://patchwork.freedesktop.org/patch/msgid/20190221163833.21393-1-
> chris@chris-wilson.co.uk

Seems like it might have reduced the reproduction rate, but unfortunately, this is still happening: https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_234/fi-kbl-guc/igt@i915_suspend@debugfs-reader.html
Comment 5 Jon Ewins 2019-08-21 01:23:47 UTC
The issue with the guc logging awareness of GuC state will still need to be fixed for when guc relay is in use, but as a first step the intention is to stop this issue occurring in BAT tests.
The relay log is a developer only feature and an extension of the standard GuC logging mechanism.  Its testing should not be part of the BAT set.  It is being implicitly enabled by existing tests, which provide no consumer for the logs.  
The issue is that IGT tests that cycle through reading the debugfs entries will call the open file op for the guc_log_relay control file and this currently causes the file to be both created and the logging to be started.  These tests have no consumer for the logs which go on to overflow, run out of sub-buffers or, as in this specific case, misbehave in suspend and reset handling.
A proposed change, that will follow as RFC patches, is to separate the creation and logging start, updating the guc_log_relay_write function, which currently just flushes the log, to also support starting the logging based on the value written.
An additional new test to actually explicitly test this relay log developer feature will follow separately.
Comment 6 Jon Ewins 2019-09-10 01:16:08 UTC
Same issue as 111148 and 111165. Proposed fix tested and passing.  Patches in prep for posting.
Comment 7 Robert M. Fosha 2019-09-10 23:08:31 UTC
RFC patch sent out for review: drm/i915/guc: Enable guc logging on guc log relay write.

While creating and testing the patch also noticed that the IGT tool intel_guc_logger needs to be updated for the new relay implementation. Will create another patch to update the IGT tool.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.