Bug 111256

Summary:	[CI][SHARDS] igt@prime_busy@wait-hang-vebox\|igt@gem_eio@in-flight-suspend - dmesg-warn - kernel BUG at ./drivers/gpu/drm/i915/intel_wakeref.h:110!
Product:	DRI	Reporter:	Lakshmi <lakshminarayana.vudum>
Component:	DRM/Intel	Assignee:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Status:	CLOSED FIXED	QA Contact:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity:	normal
Priority:	low	CC:	intel-gfx-bugs
Version:	DRI git
Hardware:	Other
OS:	All
Whiteboard:	ReadyForDev
i915 platform:	GLK, SNB	i915 features:	GEM/Other

Description Lakshmi 2019-07-30 07:17:59 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6576/shard-glk7/igt@prime_busy@wait-hang-vebox.html

<4> [163.610318] ------------[ cut here ]------------
<2> [163.610336] kernel BUG at ./drivers/gpu/drm/i915/intel_wakeref.h:110!
<4> [163.610405] invalid opcode: 0000 [#1] PREEMPT SMP PTI
<4> [163.610415] CPU: 1 PID: 30 Comm: kworker/u8:1 Tainted: G     U            5.3.0-rc2-CI-CI_DRM_6576+ #1
<4> [163.610429] Hardware name: Intel Corporation NUC7CJYH/NUC7JYB, BIOS JYGLKCPX.86A.0027.2018.0125.1347 01/25/2018
<4> [163.610531] Workqueue: i915 retire_work_handler [i915]
<4> [163.610595] RIP: 0010:intel_engine_pm_put+0x44/0x50 [i915]
<4> [163.610604] Code: 00 00 00 48 89 df e8 db cd 00 e1 85 c0 75 03 5b 5d c3 48 8d bd c8 bb 00 00 48 89 de 48 c7 c2 50 47 10 a0 5b 5d e9 6c 9b fe ff <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 48 81 c7 b8 00 00 00 48 c7 c6
<4> [163.610629] RSP: 0018:ffffc9000013bdb8 EFLAGS: 00010246
<4> [163.610638] RAX: 0000000000000000 RBX: ffff88825d8bf700 RCX: 0000000000000001
<4> [163.610648] RDX: 00000000000015b3 RSI: ffff888276083018 RDI: ffff888261bb42a8
<4> [163.610659] RBP: ffff888261ba0000 R08: ffff888276083018 R09: 0000000000000000
<4> [163.610669] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88825d8bf998
<4> [163.610680] R13: ffff88825d8bf990 R14: ffff88825d8bf760 R15: 0000000000000000
<4> [163.610690] FS:  0000000000000000(0000) GS:ffff888277e80000(0000) knlGS:0000000000000000
<4> [163.610703] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [163.610712] CR2: 00007f88a23e6000 CR3: 0000000272460000 CR4: 0000000000340ee0
<4> [163.610722] Call Trace:
<4> [163.610788]  i915_request_retire+0x35e/0x840 [i915]
<4> [163.610859]  ring_retire_requests+0x47/0x50 [i915]
<4> [163.610927]  i915_retire_requests+0x57/0xc0 [i915]
<4> [163.610993]  retire_work_handler+0x27/0x60 [i915]
<4> [163.611006]  process_one_work+0x245/0x610
<4> [163.611016]  worker_thread+0x1d0/0x380
<4> [163.611025]  ? process_one_work+0x610/0x610
<4> [163.611034]  kthread+0x119/0x130
<4> [163.611042]  ? kthread_park+0xa0/0xa0
<4> [163.611053]  ret_from_fork+0x24/0x50
<4> [163.611064] Modules linked in: vgem mei_hdcp snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic btusb btrtl btbcm btintel x86_pkg_temp_thermal coretemp bluetooth crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ecdh_generic ecc r8169 i915 realtek i2c_hid snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm mei_me mei pinctrl_geminilake prime_numbers pinctrl_intel
<0> [163.611125] Dumping ftrace buffer:
<0> [163.611132] ---------------------------------
<0> [163.611206] CPU:1 [LOST 573578 EVENTS]

Comment 1 CI Bug Log 2019-07-30 07:19:17 UTC

The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* GLK: igt@prime_busy@wait-hang-vebox - dmesg-warn - kernel BUG at ./drivers/gpu/drm/i915/intel_wakeref.h:110! 
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6576/shard-glk7/igt@prime_busy@wait-hang-vebox.html

Comment 2 Chris Wilson 2019-07-30 11:33:23 UTC

Looks very innocuous. There's at least one change pending to i915_active that will affect this, so watch this space?

Comment 3 Francesco Balestrieri 2019-07-31 11:42:52 UTC

Setting priority to low based on Chris' comment.

Comment 4 CI Bug Log 2019-08-02 06:58:37 UTC

A CI Bug Log filter associated to this bug has been updated:

{- GLK: igt@prime_busy@wait-hang-vebox - dmesg-warn - kernel BUG at ./drivers/gpu/drm/i915/intel_wakeref.h:110!  -}
{+ SNB GLK: igt@prime_busy@wait-hang-vebox - dmesg-warn - kernel BUG at ./drivers/gpu/drm/i915/intel_wakeref.h:110!  +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6601/shard-snb6/igt@gem_eio@in-flight-suspend.html

Comment 5 CI Bug Log 2019-08-02 06:58:59 UTC

A CI Bug Log filter associated to this bug has been updated:

{- SNB GLK: igt@prime_busy@wait-hang-vebox - dmesg-warn - kernel BUG at ./drivers/gpu/drm/i915/intel_wakeref.h:110!  -}
{+ SNB GLK: igt@prime_busy@wait-hang-vebox|igt@gem_eio@in-flight-suspend - dmesg-warn - kernel BUG at ./drivers/gpu/drm/i915/intel_wakeref.h:110!  +}


  No new failures caught with the new filter

Comment 6 Chris Wilson 2019-08-02 11:28:16 UTC

No causal link, but 

commit d8af05ff38ae7a42819b285ffef314942414ef8b (HEAD -> drm-intel-next-queued, drm-intel/for-linux-next, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Aug 2 11:00:15 2019 +0100

    drm/i915: Allow sharing the idle-barrier from other kernel requests
    
    By placing our idle-barriers in the i915_active fence tree, we expose
    those for reuse by other components that are issuing requests along the
    kernel_context. Reusing the proto-barrier active_node is perfectly fine
    as the new request implies a context-switch, and so an opportune point
    to run the idle-barrier. However, the proto-barrier is not equivalent
    to a normal active_node and care must be taken to avoid dereferencing the
    ERR_PTR used as its request marker.

is likely related. Watch this space.

Comment 7 Chris Wilson 2019-08-08 20:58:56 UTC

commit c7302f204490f3eb4ef839bec228315bcd3ba43f (drm-intel/for-linux-next, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Aug 8 21:27:58 2019 +0100

    drm/i915: Defer final intel_wakeref_put to process context
    
    As we need to acquire a mutex to serialise the final
    intel_wakeref_put, we need to ensure that we are in process context at
    that time. However, we want to allow operation on the intel_wakeref from
    inside timer and other hardirq context, which means that need to defer
    that final put to a workqueue.
    
    Inside the final wakeref puts, we are safe to operate in any context, as
    we are simply marking up the HW and state tracking for the potential
    sleep. It's only the serialisation with the potential sleeping getting
    that requires careful wait avoidance. This allows us to retain the
    immediate processing as before (we only need to sleep over the same
    races as the current mutex_lock).
    
    v2: Add a selftest to ensure we exercise the code while lockdep watches.
    v3: That test was extremely loud and complained about many things!
    v4: Not a whale!
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111295
    References: https://bugs.freedesktop.org/show_bug.cgi?id=111245
    References: https://bugs.freedesktop.org/show_bug.cgi?id=111256
    Fixes: 18398904ca9e ("drm/i915: Only recover active engines")
    Fixes: 51fbd8de87dc ("drm/i915/pmu: Atomically acquire the gt_pm wakeref")
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20190808202758.10453-1-chris@chris-wilson.co.uk

Probably.

Comment 8 Lakshmi 2019-11-29 13:38:03 UTC

The reproduction rate of this issue is once in 8.3 runs. Last seen CI_DRM_6601_full (3 months, 4 weeks old) and current run is 7435. Archiving this bug.

Comment 9 CI Bug Log 2019-11-29 13:38:09 UTC

The CI Bug Log issue associated to this bug has been archived.

New failures matching the above filters will not be associated to this bug anymore.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.