Bug 111639 - [CI][RESUME] igt@gem_vm_create@isolation - dmesg-warn - GEM_BUG_ON(!intel_context_is_pinned(ce))
Summary: [CI][RESUME] igt@gem_vm_create@isolation - dmesg-warn - GEM_BUG_ON(!intel_con...
Status: RESOLVED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: Other All
: high not set
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-09-11 05:35 UTC by Martin Peres
Modified: 2019-10-16 10:47 UTC (History)
1 user (show)

See Also:
i915 platform: TGL
i915 features: GEM/Other


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Peres 2019-09-11 05:35:02 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_364/fi-tgl-u/igt%40gem_vm_create%40isolation.html

<3> [62.231256] __execlists_reset:2398 GEM_BUG_ON(!intel_context_is_pinned(ce))
<4> [62.231438] ------------[ cut here ]------------
<2> [62.231444] kernel BUG at drivers/gpu/drm/i915/gt/intel_lrc.c:2398!
<4> [62.231500] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
<4> [62.231513] CPU: 0 PID: 1010 Comm: gem_vm_create Tainted: G     U            5.3.0-rc7-gedc92c969391-drmtip_364+ #1
<4> [62.231533] Hardware name: Intel Corporation Tiger Lake Client Platform/TigerLake U DDR4 SODIMM RVP, BIOS TGLSFWI1.R00.2321.A01.1908052106 08/05/2019
<4> [62.231640] RIP: 0010:__execlists_reset+0x768/0xb50 [i915]
<4> [62.231655] Code: a7 6c fd dc 48 8b 35 ff 00 24 00 49 c7 c0 d9 29 30 c0 b9 5e 09 00 00 48 c7 c2 90 26 2a c0 48 c7 c7 f3 ab 15 c0 e8 38 50 04 dd <0f> 0b 48 c7 c1 d0 93 2c c0 ba 7a 02 00 00 48 c7 c6 50 26 2a c0 48
<4> [62.231683] RSP: 0018:ffffabd78054bce0 EFLAGS: 00010082
<4> [62.231695] RAX: 000000000000000e RBX: ffff9fad95738008 RCX: 0000000000000000
<4> [62.231708] RDX: 0000000000000001 RSI: 0000000000000008 RDI: 0000000000000cb0
<4> [62.231725] RBP: ffffabd78054bd40 R08: 0000000000000000 R09: 0000000000000cb0
<4> [62.231738] R10: 00000000ab13c364 R11: ffff9fad9e897a38 R12: ffff9fad70b379c0
<4> [62.231751] R13: ffff9fad8c10be70 R14: ffff9fad70b379c0 R15: ffff9fad8db23140
<4> [62.231790] FS:  00007f41465ea300(0000) GS:ffff9fada0400000(0000) knlGS:0000000000000000
<4> [62.231801] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [62.231809] CR2: 00007fff731baca8 CR3: 0000000462a2a002 CR4: 0000000000760ef0
<4> [62.231819] PKRU: 55555554
<4> [62.231825] Call Trace:
<4> [62.231883]  execlists_cancel_requests+0x57/0x3d0 [i915]
<4> [62.231940]  __intel_gt_set_wedged.part.9+0xb2/0x180 [i915]
<4> [62.231952]  ? __drm_printfn_info+0x20/0x20
<4> [62.232005]  intel_gt_set_wedged+0x64/0x70 [i915]
<4> [62.232055]  i915_drop_caches_set+0x151/0x2f0 [i915]
<4> [62.232067]  simple_attr_write+0xb0/0xd0
<4> [62.232077]  full_proxy_write+0x51/0x80
<4> [62.232087]  vfs_write+0xbd/0x1d0
<4> [62.232094]  ksys_write+0x8f/0xe0
<4> [62.232103]  do_syscall_64+0x55/0x1c0
<4> [62.232112]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4> [62.232120] RIP: 0033:0x7f4145d69281
<4> [62.232127] Code: c3 0f 1f 84 00 00 00 00 00 48 8b 05 59 8d 20 00 c3 0f 1f 84 00 00 00 00 00 8b 05 8a d1 20 00 85 c0 75 16 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 57 f3 c3 0f 1f 44 00 00 41 54 55 49 89 d4 53
<4> [62.232149] RSP: 002b:00007fff731bda58 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
<4> [62.232161] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f4145d69281
<4> [62.232172] RDX: 0000000000000005 RSI: 00007fff731bdab0 RDI: 0000000000000007
<4> [62.232182] RBP: 0000000000000005 R08: 0000000000000000 R09: 0000000000000000
<4> [62.232192] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff731bdab0
<4> [62.232202] R13: 0000000000000007 R14: 00007fff731bdab0 R15: 00007f4145d53d80
<4> [62.232216] Modules linked in: vgem mei_hdcp i915 ax88179_178a x86_pkg_temp_thermal usbnet coretemp mii crct10dif_pclmul crc32_pclmul ghash_clmulni_intel mei_me mei prime_numbers
Comment 1 CI Bug Log 2019-09-11 05:37:20 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* TGL: igt@gem_vm_create@isolation - dmesg-warn - GEM_BUG_ON(!intel_context_is_pinned(ce))
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_363/fi-tgl-u/igt@gem_vm_create@isolation.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_364/fi-tgl-u/igt@gem_vm_create@isolation.html

* TGL: igt@runner@aborted - fail - Previous test: gem_vm_create (isolation)
  (No new failures associated)
Comment 2 Chris Wilson 2019-09-11 07:51:32 UTC
Hmm, it looks like we dropped a pin on the rcs0->kernel_context. Bad, very bad.
Comment 4 Chris Wilson 2019-09-18 07:29:36 UTC
(In reply to Sudeep Dutt from comment #3)
> Test is passing @
> 
> https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_371/fi-tgl-u/
> igt%40gem_vm_create%40isolation.html

It's not a causal link, the bug is unrelated to the test. It just happened to show up as we cleaned up the tgl hangs. My presumption is that we hit an error path and tried to clean up a context twice (the most fragile link there is the active pin).
Comment 5 Chris Wilson 2019-09-23 15:25:20 UTC
commit ae911b23d2f06c5d0a3e32768bedea857cadd269
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Sep 23 12:00:53 2019 +0100

    drm/i915/execlists: Relax assertion for a pinned context image on reset
    
    A gpu hang can occur at any time, given a sufficiently angry gpu. An
    example is when it forgets to perform a context-switch at the end of a
    request, leaving us with a hanging GPU on a completed request. Here, we
    may retire the request, only leaving its context alive via the active
    barrier. When we reset the GPU on a completed request, we do not modify
    its context image (just updating the ring state) and can safely defer
    the assertion that we have the image pinned and ready to modify.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111639
    Fixes: dffa8feb3084 ("drm/i915/perf: Assert locking for i915_init_oa_perf_state()")
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20190923110056.15176-1-chris@chris-wilson.co.uk
Comment 6 Martin Peres 2019-10-16 10:47:23 UTC
(In reply to Chris Wilson from comment #5)
> commit ae911b23d2f06c5d0a3e32768bedea857cadd269
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Mon Sep 23 12:00:53 2019 +0100
> 
>     drm/i915/execlists: Relax assertion for a pinned context image on reset
>     
>     A gpu hang can occur at any time, given a sufficiently angry gpu. An
>     example is when it forgets to perform a context-switch at the end of a
>     request, leaving us with a hanging GPU on a completed request. Here, we
>     may retire the request, only leaving its context alive via the active
>     barrier. When we reset the GPU on a completed request, we do not modify
>     its context image (just updating the ring state) and can safely defer
>     the assertion that we have the image pinned and ready to modify.
>     
>     Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111639
>     Fixes: dffa8feb3084 ("drm/i915/perf: Assert locking for
> i915_init_oa_perf_state()")
>     Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>     Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>     Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>     Link:
> https://patchwork.freedesktop.org/patch/msgid/20190923110056.15176-1-
> chris@chris-wilson.co.uk

Thanks, this issue was seen twice on fi-tgl-u, two runs apart. Now it has not been seen in 23 runs. Looks good!
Comment 7 CI Bug Log 2019-10-16 10:47:33 UTC
The CI Bug Log issue associated to this bug has been archived.

New failures matching the above filters will not be associated to this bug anymore.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.