Bug 104786 - [CI] [KBL-only] igt@drv_selftest@live_hangcheck - dmesg-fail - kernel stack overflow
Summary: [CI] [KBL-only] igt@drv_selftest@live_hangcheck - dmesg-fail - kernel stack o...
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: Other All
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
: 104594 (view as bug list)
Depends on:
Blocks:
 
Reported: 2018-01-25 12:17 UTC by Martin Peres
Modified: 2018-04-20 11:02 UTC (History)
2 users (show)

See Also:
i915 platform: KBL
i915 features: GEM/Other


Attachments

Description Martin Peres 2018-01-25 12:17:18 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3683/shard-kbl4/igt@drv_selftest@live_hangcheck.html

[  200.430432] BUG: stack guard page was hit at 00000000a99d6f9e (stack is 00000000433da8e3..00000000af5c7482)
[  200.430438] kernel stack overflow (double-fault): 0000 [#1] PREEMPT SMP PTI
Comment 1 Chris Wilson 2018-01-25 12:18:55 UTC
https://bugs.freedesktop.org/show_bug.cgi?id=104262#c8
> With a kasan run to investigate the stack page overflow:
> 
> https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_1702/shard-kbl6/
> igt@drv_selftest@live_hangcheck.html
> 
> it passed. Note that one thing that kasan does is disable CONFIG_STACK_VMAP,
> so changing the stack allocatin/layout. Still not the positive lead I was
> hoping for.
Comment 2 Chris Wilson 2018-01-29 15:02:32 UTC
Mysterious https://intel-gfx-ci.01.org/CI/CI_DRM_3687/shard-kbl1/igt@drv_selftest@live_hangcheck.html

Let's see if it has magically resolved itself!
Comment 3 Chris Wilson 2018-01-29 15:19:43 UTC
(In reply to Chris Wilson from comment #2)
> Mysterious
> https://intel-gfx-ci.01.org/CI/CI_DRM_3687/shard-kbl1/
> igt@drv_selftest@live_hangcheck.html
> 
> Let's see if it has magically resolved itself!

Just a fluke.
Comment 4 Chris Wilson 2018-01-31 15:31:49 UTC
Infinite recursion:

[  202.346915] BUG: stack guard page was hit at 00000000fdec3e36 (stack is 00000000cfb340f3..00000000783f2a8f)
[  202.346915] kernel stack overflow (double-fault): 0000 [#1] PREEMPT SMP PTI
[  202.346915] Dumping ftrace buffer:
[  202.346915]    (ftrace buffer empty)
[  202.346916] Modules linked in: i915(+) snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec x86_pkg_temp_thermal intel_powerclamp coretemp e1000e snd_hwdep crct10dif_pclmul snd_hda_core crc32_pclmul ghash_clmulni_intel snd_pcm mei_me mei ptp pps_core prime_numbers [last unloaded: i915]
[  202.346919] CPU: 0 PID: 5985 Comm: drv_selftest Tainted: G     U           4.15.0-CI-Trybot_1728+ #1
[  202.346919] Hardware name:                  /NUC7i5BNB, BIOS BNKBL357.86A.0054.2017.1025.1822 10/25/2017
[  202.346920] RIP: 0010:__lock_acquire+0x3b/0x1b60
[  202.346920] RSP: 0018:ffffc90000243f90 EFLAGS: 00010086
[  202.346920] RAX: 0000000000000000 RBX: 0000000000000086 RCX: 0000000000000000
[  202.346921] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff82244958
[  202.346921] RBP: ffffc90000244050 R08: 0000000000000001 R09: 0000000000000001
[  202.346921] R10: 0000000000000000 R11: ffffffff810eae9e R12: 0000000000000000
[  202.346921] R13: ffff880273260040 R14: 0000000000000001 R15: 0000000000000000
[  202.346922] FS:  00007f654b75f8c0(0000) GS:ffff88027ec00000(0000) knlGS:0000000000000000
[  202.346922] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  202.346922] CR2: ffffc90000243f88 CR3: 000000025578e006 CR4: 00000000003606f0
[  202.346922] Call Trace:
[  202.346922]  ? lock_acquire+0xaf/0x200
[  202.346923]  lock_acquire+0xaf/0x200
[  202.346923]  ? vprintk_emit+0x6e/0x3b0
[  202.346923]  _raw_spin_lock+0x2a/0x40
[  202.346923]  ? vprintk_emit+0x6e/0x3b0
[  202.346923]  vprintk_emit+0x6e/0x3b0
[  202.346924]  dev_vprintk_emit+0x94/0x200
[  202.346924]  ? deactivate_slab.isra.23+0x856/0x880
[  202.346924]  dev_printk_emit+0x36/0x40
[  202.346924]  ? lock_acquire+0xaf/0x200
[  202.346924]  dev_notice+0x50/0x60
[  202.346925]  ? i915_gem_unset_wedged+0x151/0x180 [i915]
[  202.346925]  i915_reset+0x177/0x270 [i915]
[  202.346925]  __i915_wait_request_check_and_reset.isra.9.part.10+0x26/0x30 [i915]
[  202.346925]  i915_wait_request+0x77b/0x820 [i915]
[  202.346925]  ? ___slab_alloc.constprop.30+0x152/0x3d0
[  202.346926]  ? wake_up_q+0x70/0x70
[  202.346926]  ? wake_up_q+0x70/0x70
[  202.346926]  wait_for_space+0x91/0x150 [i915]
[  202.346926]  intel_ring_begin+0x113/0x1a0 [i915]
[  202.346926]  gen8_emit_flush_render+0x96/0x260 [i915]
[  202.346927]  i915_gem_request_alloc+0x2c8/0x5e0 [i915]
[  202.346927]  i915_gem_reset+0x107/0x130 [i915]
[  202.346927]  i915_reset+0x207/0x270 [i915]
[  202.346927]  __i915_wait_request_check_and_reset.isra.9.part.10+0x26/0x30 [i915]
[  202.346928]  i915_wait_request+0x77b/0x820 [i915]
[  202.346928]  ? ___slab_alloc.constprop.30+0x152/0x3d0
[  202.346928]  ? wake_up_q+0x70/0x70
[  202.346928]  ? wake_up_q+0x70/0x70
[  202.346928]  wait_for_space+0x91/0x150 [i915]
[  202.346929]  intel_ring_begin+0x113/0x1a0 [i915]
[  202.346929]  gen8_emit_flush_render+0x96/0x260 [i915]
[  202.346929]  i915_gem_request_alloc+0x2c8/0x5e0 [i915]
[  202.346929]  i915_gem_reset+0x107/0x130 [i915]
[  202.346929]  i915_reset+0x207/0x270 [i915]
[  202.346930]  __i915_wait_request_check_and_reset.isra.9.part.10+0x26/0x30 [i915]
[  202.346930]  i915_wait_request+0x77b/0x820 [i915]
[  202.346930]  ? ___slab_alloc.constprop.30+0x152/0x3d0
[  202.346930]  ? wake_up_q+0x70/0x70
[  202.346930]  ? wake_up_q+0x70/0x70
[  202.346931]  wait_for_space+0x91/0x150 [i915]
[  202.346931]  intel_ring_begin+0x113/0x1a0 [i915]
[  202.346931]  gen8_emit_flush_render+0x96/0x260 [i915]
[  202.346931]  i915_gem_request_alloc+0x2c8/0x5e0 [i915]
[  202.346931]  i915_gem_reset+0x107/0x130 [i915]
[  202.346932]  i915_reset+0x207/0x270 [i915]
[  202.346932]  __i915_wait_request_check_and_reset.isra.9.part.10+0x26/0x30 [i915]
[  202.346932]  i915_wait_request+0x77b/0x820 [i915]
[  202.346932]  ? ___slab_alloc.constprop.30+0x152/0x3d0
[  202.346932]  ? wake_up_q+0x70/0x70
[  202.346933]  ? wake_up_q+0x70/0x70
[  202.346933]  wait_for_space+0x91/0x150 [i915]
[  202.346933]  intel_ring_begin+0x113/0x1a0 [i915]
[  202.346933]  gen8_emit_flush_render+0x96/0x260 [i915]
[  202.346933]  i915_gem_request_alloc+0x2c8/0x5e0 [i915]
[  202.346934]  i915_gem_reset+0x107/0x130 [i915]
[  202.346934]  i915_reset+0x207/0x270 [i915]
[  202.346934]  __i915_wait_request_check_and_reset.isra.9.part.10+0x26/0x30 [i915]
[  202.346934]  i915_wait_request+0x77b/0x820 [i915]
[  202.346935]  ? ___slab_alloc.constprop.30+0x152/0x3d0
[  202.346935]  ? wake_up_q+0x70/0x70
[  202.346935]  ? wake_up_q+0x70/0x70
[  202.346935]  wait_for_space+0x91/0x150 [i915]
[  202.346935]  intel_ring_begin+0x113/0x1a0 [i915]
[  202.346936]  gen8_emit_flush_render+0x96/0x260 [i915]
[  202.346936]  i915_gem_request_alloc+0x2c8/0x5e0 [i915]
[  202.346936]  i915_gem_reset+0x107/0x130 [i915]
[  202.346936]  i915_reset+0x207/0x270 [i915]
[  202.346936]  __i915_wait_request_check_and_reset.isra.9.part.10+0x26/0x30 [i915]
[  202.346937]  i915_wait_request+0x77b/0x820 [i915]
[  202.346937]  ? ___slab_alloc.constprop.30+0x152/0x3d0
[  202.346937]  ? wake_up_q+0x70/0x70
[  202.346937]  ? wake_up_q+0x70/0x70
[  202.346937]  wait_for_space+0x91/0x150 [i915]
[  202.346938]  intel_ring_begin+0x113/0x1a0 [i915]
[  202.346938]  gen8_emit_flush_render+0x96/0x260 [i915]
[  202.346938]  i915_gem_request_alloc+0x2c8/0x5e0 [i915]
[  202.346938]  i915_gem_reset+0x107/0x130 [i915]
[  202.346938]  i915_reset+0x207/0x270 [i915]
[  202.346939]  __i915_wait_request_check_and_reset.isra.9.part.10+0x26/0x30 [i915]
[  202.346939]  i915_wait_request+0x77b/0x820 [i915]
[  202.346939]  ? ___slab_alloc.constprop.30+0x152/0x3d0
[  202.346939]  ? wake_up_q+0x70/0x70
[  202.346939]  ? wake_up_q+0x70/0x70
[  202.346940]  wait_for_space+0x91/0x150 [i915]
[  202.346940]  intel_ring_begin+0x113/0x1a0 [i915]
[  202.346940]  gen8_emit_flush_render+0x96/0x260 [i915]
[  202.346940]  i915_gem_request_alloc+0x2c8/0x5e0 [i915]
[  202.346940]  i915_gem_reset+0x107/0x130 [i915]
[  202.346941]  i915_reset+0x207/0x270 [i915]
[  202.346941]  __i915_wait_request_check_and_reset.isra.9.part.10+0x26/0x30 [i915]
[  202.346941]  i915_wait_request+0x77b/0x820 [i915]
[  202.346941]  ? ___slab_alloc.constprop.30+0x152/0x3d0
[  202.346942]  ? wake_up_q+0x70/0x70
[  202.346942]  ? wake_up_q+0x70/0x70
[  202.346942]  wait_for_space+0x91/0x150 [i915]
[  202.346942]  intel_ring_begin+0x113/0x1a0 [i915]
[  202.346942]  gen8_emit_flush_render+0x96/0x260 [i915]
[  202.346943]  i915_gem_request_alloc+0x2c8/0x5e0 [i915]
[  202.346943]  i915_gem_reset+0x107/0x130 [i915]
[  202.346943]  i915_reset+0x207/0x270 [i915]
[  202.346943]  __i915_wait_request_check_and_reset.isra.9.part.10+0x26/0x30 [i915]
[  202.346943]  i915_wait_request+0x77b/0x820 [i915]
[  202.346944]  ? ___slab_alloc.constprop.30+0x152/0x3d0
[  202.346944]  ? wake_up_q+0x70/0x70
[  202.346944]  ? wake_up_q+0x70/0x70
[  202.346944]  wait_for_space+0x91/0x150 [i915]
[  202.346944]  intel_ring_begin+0x113/0x1a0 [i915]
[  202.346945]  gen8_emit_flush_render+0x96/0x260 [i915]
[  202.346945]  i915_gem_request_alloc+0x2c8/0x5e0 [i915]
[  202.346945]  i915_gem_reset+0x107/0x130 [i915]
[  202.346945]  i915_reset+0x207/0x270 [i915]
[  202.346945]  __i915_wait_request_check_and_reset.isra.9.part.10+0x26/0x30 [i915]
[  202.346946]  i915_wait_request+0x77b/0x820 [i915]
[  202.346946]  ? ___slab_alloc.constprop.30+0x152/0x3d0
[  202.346946]  ? wake_up_q+0x70/0x70
[  202.346946]  ? wake_up_q+0x70/0x70
[  202.346946]  wait_for_space+0x91/0x150 [i915]
[  202.346947]  intel_ring_begin+0x113/0x1a0 [i915]
[  202.346947]  gen8_emit_flush_render+0x96/0x260 [i915]
[  202.346947]  i915_gem_request_alloc+0x2c8/0x5e0 [i915]
[  202.346947]  i915_gem_reset+0x107/0x130 [i915]
[  202.346947]  i915_reset+0x207/0x270 [i915]
[  202.346948]  __i915_wait_request_check_and_reset.isra.9.part.10+0x26/0x30 [i915]
[  202.346948]  i915_wait_request+0x77b/0x820 [i915]
[  202.346948]  ? ___slab_alloc.constprop.30+0x152/0x3d0
[  202.346948]  ? wake_up_q+0x70/0x70
[  202.346948]  ? wake_up_q+0x70/0x70
[  202.346949]  wait_for_space+0x91/0x150 [i915]
[  202.346949]  intel_ring_begin+0x113/0x1a0 [i915]
[  202.346949]  gen8_emit_flush_render+0x96/0x260 [i915]
[  202.346949]  i915_gem_request_alloc+0x2c8/0x5e0 [i915]
[  202.346949]  i915_gem_reset+0x107/0x130 [i915]
[  202.346950]  i915_reset+0x207/0x270 [i915]
[  202.346950]  __i915_wait_request_check_and_reset.isra.9.part.10+0x26/0x30 [i915]
[  202.346950]  i915_wait_request+0x77b/0x820 [i915]
[  202.346950]  ? ___slab_alloc.constprop.30+0x152/0x3d0
[  202.346950]  ? wake_up_q+0x70/0x70
[  202.346951]  ? wake_up_q+0x70/0x70
[  202.346951]  wait_for_space+0x91/0x150 [i915]
[  202.346951]  intel_ring_begin+0x113/0x1a0 [i915]
[  202.346951]  gen8_emit_flush_render+0x96/0x260 [i915]
[  202.346951]  i915_gem_request_alloc+0x2c8/0x5e0 [i915]
[  202.346952]  i915_gem_reset+0x107/0x130 [i915]
[  202.346952]  i915_reset+0x207/0x270 [i915]
[  202.346952]  __i915_wait_request_check_and_reset.isra.9.part.10+0x26/0x30 [i915]
[  202.346952]  i915_wait_request+0x77b/0x820 [i915]
[  202.346952]  ? ___slab_alloc.constprop.30+0x152/0x3d0
[  202.346953]  ? wake_up_q+0x70/0x70
[  202.346953]  ? wake_up_q+0x70/0x70
[  202.346953]  wait_for_space+0x91/0x150 [i915]
[  202.346953]  intel_ring_begin+0x113/0x1a0 [i915]
[  202.346953]  gen8_emit_flush_render+0x96/0x260 [i915]
[  202.346954]  i915_gem_request_alloc+0x2c8/0x5e0 [i915]
[  202.346954]  i915_gem_reset+0x107/0x130 [i915]
[  202.346954]  i915_reset+0x207/0x270 [i915]
[  202.346954]  __i915_wait_request_check_and_reset.isra.9.part.10+0x26/0x30 [i915]
[  202.346955]  i915_wait_request+0x77b/0x820 [i915]
[  202.346955]  ? ___slab_alloc.constprop.30+0x152/0x3d0
[  202.346955]  ? wake_up_q+0x70/0x70
[  202.346955]  ? wake_up_q+0x70/0x70
[  202.346955]  wait_for_space+0x91/0x150 [i915]
[  202.346956]  intel_ring_begin+0x113/0x1a0 [i915]
[  202.346956]  gen8_emit_flush_render+0x96/0x260 [i915]
[  202.346956]  i915_gem_request_alloc+0x2c8/0x5e0 [i915]
[  202.346956]  i915_gem_reset+0x107/0x130 [i915]
[  202.346956]  i915_reset+0x207/0x270 [i915]
[  202.346957]  __i915_wait_request_check_and_reset.isra.9.part.10+0x26/0x30 [i915]
[  202.346957]  i915_wait_request+0x77b/0x820 [i915]
[  202.346957]  ? ___slab_alloc.constprop.30+0x152/0x3d0
[  202.346957]  ? wake_up_q+0x70/0x70
[  202.346957]  ? wake_up_q+0x70/0x70
[  202.346958]  wait_for_space+0
[  202.346958] Lost 238 message(s)!

Mainly due to the selftest design. Hmm.
Comment 5 Chris Wilson 2018-02-01 15:54:54 UTC
*** Bug 104594 has been marked as a duplicate of this bug. ***
Comment 6 Marta Löfstedt 2018-02-02 07:10:56 UTC
(In reply to Chris Wilson from comment #5)
> *** Bug 104594 has been marked as a duplicate of this bug. ***

OK Chris.

Just a reminder that bug 104262 is the one tracked in cibuglog.
Comment 7 Chris Wilson 2018-02-02 09:01:31 UTC
After the stack overflow is fixed, we are still likely to get the bad read from the CSB register sporadically occurring, so 104262 will have a long life yet.
Comment 8 Chris Wilson 2018-02-05 16:17:28 UTC
commit 01b8fdc5222007bdfc905941173f82576898a7f7
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Feb 5 15:24:31 2018 +0000

    drm/i915: Skip post-reset request emission if the engine is not idle
    
    Since commit 7b6da818d86f ("drm/i915: Restore the kernel context after a
    GPU reset on an idle engine") we submit a request following the engine
    reset. The intent is that we don't submit a request if the engine is
    busy (as it will restart active by itself) but we only checked to see if
    there were remaining requests in flight on the hardware and skipped
    checking to see if there were any ready requests that would be
    immediately submitted on restart (the same time as our new request would
    be). Having convinced the engine to appear idle in the previous patch,
    we can use intel_engine_is_idle() as a better test to only submit a new
    request if there are no pending requests.
    
    As it happens, this is tripping up igt/drv_selftest/live_hangcheck in CI
    as we overfill the kernel_context ringbuffer trigger an infinite
    recursion from within the reset.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104786
    References: 7b6da818d86f ("drm/i915: Restore the kernel context after a GPU reset on an idle engine")
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Cc: Michel Thierry <michel.thierry@intel.com>
    Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180205152431.12163-4-chris@chris-wilson.co.uk
Comment 9 Jani Saarinen 2018-04-20 11:02:56 UTC
Closing, please re-open if still occurs.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.