<4> [46.791873] ------------[ cut here ]------------
<4> [46.791881] Unexpected send: action=0x30
<4> [46.791917] WARNING: CPU: 6 PID: 397 at drivers/gpu/drm/i915/intel_guc.c:373 intel_guc_send_nop+0xe/0x20 [i915]
<4> [46.791918] Modules linked in: snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 x86_pkg_temp_thermal snd_hda_intel coretemp snd_hda_codec snd_hwdep crct10dif_pclmul snd_hda_core crc32_pclmul ghash_clmulni_intel snd_pcm e1000e mei_me mei i2c_i801 prime_numbers
<4> [46.791927] CPU: 6 PID: 397 Comm: kworker/u17:0 Tainted: G U 5.0.0-rc4-gfd25766ba02f-drmtip_207+ #1
<4> [46.791928] Hardware name: System manufacturer System Product Name/Z170M-PLUS, BIOS 3610 03/29/2018
<4> [46.791950] Workqueue: i915-guc_log capture_logs_work [i915]
<4> [46.791972] RIP: 0010:intel_guc_send_nop+0xe/0x20 [i915]
<4> [46.791974] Code: 01 00 00 00 be c8 c4 00 00 e9 8e 16 95 e9 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 8b 36 48 c7 c7 82 0c 39 c0 e8 a2 22 e0 e8 <0f> 0b b8 ed ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 48 c7 c7 f8
<4> [46.791975] RSP: 0018:ffffa390003dfdb8 EFLAGS: 00010286
<4> [46.791976] RAX: 0000000000000000 RBX: 00000000a0600001 RCX: 0000000000000000
<4> [46.791977] RDX: 0000000000000007 RSI: ffffffffaa0878eb RDI: 00000000ffffffff
<4> [46.791979] RBP: 0000000000013000 R08: 000000000d47896a R09: 0000000000000000
<4> [46.791980] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000002000
<4> [46.791981] R13: ffffa390008d1060 R14: ffff93f0cee11380 R15: ffff93f0cee10000
<4> [46.791982] FS: 0000000000000000(0000) GS:ffff93f0eeb80000(0000) knlGS:0000000000000000
<4> [46.791983] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [46.791984] CR2: 00007f64864fb788 CR3: 00000000b0214002 CR4: 00000000003606e0
<4> [46.791985] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4> [46.791986] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
<4> [46.791987] Call Trace:
<4> [46.792009] guc_log_capture_logs+0x2d9/0x460 [i915]
<4> [46.792016] process_one_work+0x245/0x610
<4> [46.792020] worker_thread+0x1d0/0x380
<4> [46.792022] ? process_one_work+0x610/0x610
<4> [46.792024] kthread+0x119/0x130
<4> [46.792026] ? kthread_park+0x80/0x80
<4> [46.792029] ret_from_fork+0x3a/0x50
<4> [46.792034] irq event stamp: 98
<4> [46.792037] hardirqs last enabled at (97): [<ffffffffa9126f0e>] console_unlock+0x3fe/0x600
<4> [46.792038] hardirqs last disabled at (98): [<ffffffffa90019b0>] trace_hardirqs_off_thunk+0x1a/0x1c
<4> [46.792040] softirqs last enabled at (84): [<ffffffffa9c0033a>] __do_softirq+0x33a/0x4b9
<4> [46.792042] softirqs last disabled at (73): [<ffffffffa90b9a41>] irq_exit+0xd1/0xe0
<4> [46.792063] WARNING: CPU: 6 PID: 397 at drivers/gpu/drm/i915/intel_guc.c:373 intel_guc_send_nop+0xe/0x20 [i915]
<4> [46.792064] ---[ end trace 7c1cf8091b0b22a9 ]---
*** Bug 109537 has been marked as a duplicate of this bug. ***
The guc is in a reset state (from an earlier failure leaving the driver wedged) and not able to respond. Either this should not be a warn, or the logging should be more intelligent about when guc is disabled.
Worth a shot to see if this is fixed by
Author: Chris Wilson <firstname.lastname@example.org>
Date: Thu Feb 21 16:38:33 2019 +0000
drm/i915/guc: Flush the residual log capture irq on disabling
As we disable the log capture events, flush any residual interrupt
before we flush and disable the worker.
v2: Mika pointed out that it wasn't the worker re-queueing itself, but a
Signed-off-by: Chris Wilson <email@example.com>
Cc: Mika Kuoppala <firstname.lastname@example.org>
Reviewed-by: Mika Kuoppala <email@example.com>
(In reply to Chris Wilson from comment #3)
> Worth a shot to see if this is fixed by
> commit 07c100b187332101220baf7446b4f09296d7c59b
> Author: Chris Wilson <firstname.lastname@example.org>
> Date: Thu Feb 21 16:38:33 2019 +0000
> drm/i915/guc: Flush the residual log capture irq on disabling
> As we disable the log capture events, flush any residual interrupt
> before we flush and disable the worker.
> v2: Mika pointed out that it wasn't the worker re-queueing itself, but a
> rogue irq.
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109716
> Signed-off-by: Chris Wilson <email@example.com>
> Cc: Mika Kuoppala <firstname.lastname@example.org>
> Reviewed-by: Mika Kuoppala <email@example.com>
Seems like it might have reduced the reproduction rate, but unfortunately, this is still happening: https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_234/fi-kbl-guc/igt@firstname.lastname@example.org
The issue with the guc logging awareness of GuC state will still need to be fixed for when guc relay is in use, but as a first step the intention is to stop this issue occurring in BAT tests.
The relay log is a developer only feature and an extension of the standard GuC logging mechanism. Its testing should not be part of the BAT set. It is being implicitly enabled by existing tests, which provide no consumer for the logs.
The issue is that IGT tests that cycle through reading the debugfs entries will call the open file op for the guc_log_relay control file and this currently causes the file to be both created and the logging to be started. These tests have no consumer for the logs which go on to overflow, run out of sub-buffers or, as in this specific case, misbehave in suspend and reset handling.
A proposed change, that will follow as RFC patches, is to separate the creation and logging start, updating the guc_log_relay_write function, which currently just flushes the log, to also support starting the logging based on the value written.
An additional new test to actually explicitly test this relay log developer feature will follow separately.
Same issue as 111148 and 111165. Proposed fix tested and passing. Patches in prep for posting.
RFC patch sent out for review: drm/i915/guc: Enable guc logging on guc log relay write.
While creating and testing the patch also noticed that the IGT tool intel_guc_logger needs to be updated for the new relay implementation. Will create another patch to update the IGT tool.