Bug 111732

Summary: [CI][SHARDS] igt@perf@gen8-unprivileged-single-ctx-counters|igt@gem_softpin@softpin| igt@gem_pipe_control_store_loop@reused-buffer - timeout - Received signal SIGQUIT.
Product: DRI Reporter: Lakshmi <lakshminarayana.vudum>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: RESOLVED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: not set    
Priority: not set CC: intel-gfx-bugs
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: BXT, GLK, ICL, KBL, SKL i915 features:

Description Lakshmi 2019-09-18 13:31:22 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_5188/shard-skl1/igt@perf@gen8-unprivileged-single-ctx-counters.html
Starting subtest: gen8-unprivileged-single-ctx-counters
Received signal SIGQUIT.
Stack trace: 
 #0 [fatal_sig_handler+0xd6]
 #1 [killpg+0x40]
 #2 [waitpid+0x12]
 #3 [__waitpid+0x46]
 #4 [igt_wait_helper+0x11]
 #5 [gen8_test_single_ctx_render_target_writes_a_counter+0x168]
 #6 [__real_main4067+0x160e]
 #7 [main+0x27]
 #8 [__libc_start_main+0xe7]
 #9 [_start+0x2a]
Comment 1 CI Bug Log 2019-09-18 13:32:12 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* SKL: igt@perf@gen8-unprivileged-single-ctx-counters - timeout - Received signal SIGQUIT.
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGT_5188/shard-skl1/igt@perf@gen8-unprivileged-single-ctx-counters.html
Comment 2 CI Bug Log 2019-10-01 10:49:10 UTC
A CI Bug Log filter associated to this bug has been updated:

{- SKL: igt@perf@gen8-unprivileged-single-ctx-counters - timeout - Received signal SIGQUIT. -}
{+ SKL: igt@perf@gen8-unprivileged-single-ctx-counters| igt@gem_softpin@softpin| igt@gem_pipe_control_store_loop@reused-buffer - timeout - Received signal SIGQUIT. +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6977/shard-skl7/igt@gem_pipe_control_store_loop@reused-buffer.html
  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6978/shard-skl6/igt@gem_softpin@softpin.html
Comment 3 CI Bug Log 2019-10-11 12:51:07 UTC
A CI Bug Log filter associated to this bug has been updated:

{- SKL: igt@perf@gen8-unprivileged-single-ctx-counters| igt@gem_softpin@softpin| igt@gem_pipe_control_store_loop@reused-buffer - timeout - Received signal SIGQUIT. -}
{+ SKL: random tests - timeout - Received signal SIGQUIT. +}


  No new failures caught with the new filter
Comment 4 Vanshidhar Konda 2019-10-29 18:49:02 UTC
This issue is being observed 1-2 times every week since Sep 17, 2019 on different SKL shard machines. From the most recent log it seems like there is a deadlock between some of the threads.
Comment 5 Chris Wilson 2019-11-16 09:13:16 UTC
Showing all locks held in the system:
<4> [555.972811] 1 lock held by khungtaskd/34:
<4> [555.972830]  #0: ffffffff82247a60 (rcu_read_lock){....}, at: debug_show_all_locks+0xe/0x1a0
<4> [555.972900] 3 locks held by kworker/3:2/160:
<4> [555.972919]  #0: ffff88817a08f528 ((wq_completion)events){+.+.}, at: process_one_work+0x1de/0x620
<4> [555.972972]  #1: ffffc90000f17e78 ((work_completion)(&(&engine->heartbeat.work)->work)){+.+.}, at: process_one_work+0x1de/0x620
<4> [555.973024]  #2: ffff888172908e70 (&kernel#2){+.+.}, at: heartbeat+0x3f7/0x5f0 [i915]
<4> [555.973427] 4 locks held by kworker/3:3/392:
<4> [555.973445]  #0: ffff88817a08f528 ((wq_completion)events){+.+.}, at: process_one_work+0x1de/0x620
<4> [555.973497]  #1: ffffc9000034be78 ((work_completion)(&(&gt->requests.retire_work)->work)){+.+.}, at: process_one_work+0x1de/0x620
<4> [555.973547]  #2: ffff88817993ac70 (&timeline->mutex){+.+.}, at: intel_gt_retire_requests_timeout+0xe2/0x540 [i915]
<4> [555.973857]  #3: ffff88816e17d928 (&ce->pin_mutex/1){+.+.}, at: intel_context_unpin+0x54/0x100 [i915]
<4> [555.974242] 4 locks held by in:imklog/403:
<4> [555.974262] 4 locks held by rs:main Q:Reg/404:
<4> [555.974288] 4 locks held by dmesg/963:
<4> [555.974308] 3 locks held by kworker/1:1H/1084:
<4> [555.974324]  #0: ffff888179630f28 ((wq_completion)kblockd){+.+.}, at: process_one_work+0x1de/0x620
<4> [555.974368]  #1: ffffc9000070be78 ((work_completion)(&(&hctx->run_work)->work)){+.+.}, at: process_one_work+0x1de/0x620
<4> [555.974415]  #2: ffff888176da6e58 (hctx->srcu){....}, at: hctx_lock+0x5b/0xe0
<4> [555.974480] 5 locks held by perf/1180:
<4> [555.974499]  #0: ffff88815810ba80 (&perf->lock){+.+.}, at: i915_perf_open_ioctl+0x242/0x6f0 [i915]
<4> [555.974873]  #1: ffff888170b93478 (&ctx->engines_mutex){+.+.}, at: lrc_configure_all_contexts+0x26b/0x880 [i915]
<4> [555.975330]  #2: ffff88816e17d928 (&ce->pin_mutex){+.+.}, at: lrc_configure_all_contexts+0x2d7/0x880 [i915]
<4> [555.975693]  #3: ffff888172908e70 (&kernel#2){+.+.}, at: i915_request_create+0x16/0x1b0 [i915]
<4> [555.976136]  #4: ffff88817993ac70 (&timeline->mutex/1){+.+.}, at: intel_context_prepare_remote_request+0x3f/0x180 [i915]
<4> [555.976485] 


Looks like lock inversion between the loop to update all contexts and timeline retiring.
Comment 6 Chris Wilson 2019-11-22 00:34:07 UTC
*** Bug 112364 has been marked as a duplicate of this bug. ***
Comment 8 CI Bug Log 2019-11-26 08:11:49 UTC
A CI Bug Log filter associated to this bug has been updated:

{- BLB GLK: Random tests - timeout - Showing all locks held in the system -}
{+ BLB APL SKL KBL GLK ICL: Random tests - timeout - Received signal SIGQUIT, Showing all locks held in the system +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_7417/shard-iclb5/igt@perf@oa-exponents.html
  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_7418/shard-iclb1/igt@perf@oa-exponents.html
  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_7419/shard-apl7/igt@perf@oa-exponents.html
  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_7419/shard-kbl7/igt@perf@oa-exponents.html
  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_7419/shard-skl8/igt@perf@oa-exponents.html
Comment 10 Chris Wilson 2019-11-26 14:50:22 UTC
commit 58b4c1a07ada7fb91ea757bdb9bd47df02207357 (HEAD -> drm-intel-next-queued, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Nov 26 06:55:21 2019 +0000

    drm/i915: Reduce nested prepare_remote_context() to a trylock
    
    On context retiring, we may invoke the kernel_context to unpin this
    context. Elsewhere, we may use the kernel_context to modify this
    context. This currently leads to an AB-BA lock inversion, so we need to
    back-off from the contended lock, and repeat.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111732
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Fixes: a9877da2d629 ("drm/i915/oa: Reconfigure contexts on the fly")
    Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20191126065521.2331017-1-chris@chris-wilson.co.uk
Comment 11 CI Bug Log 2019-11-27 08:33:03 UTC
A CI Bug Log filter associated to this bug has been updated:

{- BLB APL SKL KBL GLK ICL: Random tests - timeout - Received signal SIGQUIT, Showing all locks held in the system -}
{+ BLB APL SKL KBL GLK ICL: Random tests - timeout - Received signal SIGQUIT, Showing all locks held in the system +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_7423/shard-skl10/igt@perf@enable-disable.html

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.