Bug 109626

Summary: [CI][SHARDS] igt@i915_selftest@live_workarounds - incomplete
Product: DRI Reporter: Martin Peres <martin.peres>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: high CC: intel-gfx-bugs
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard: ReadyForDev
i915 platform: CFL, ICL i915 features: GEM/Other

Description Martin Peres 2019-02-13 22:22:34 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5595/fi-icl-u3/igt@i915_selftest@live_workarounds.html

<0>[  425.800821] ksoftirq-21      2d.s1 443464553us : process_csb: rcs0 cs-irq head=5, tail=1
<0>[  425.800821] ksoftirq-21      2d.s1 443464554us : process_csb: rcs0 csb[0]: status=0x10000001:0x00000000, active=0x1
<0>[  425.800821] ksoftirq-21      2d.s1 443464555us : process_csb: rcs0 csb[1]: status=0x10000018:0x00000060, active=0x5
<0>[  425.800821] ksoftirq-21      2d.s1 443464557us : process_csb: rcs0 out[0]: ctx=96.1, global=8 (fence 401c:2) (current 0:7), prio=2
<0>[  425.800821] i915_sel-4429    0.... 443464742us : i915_request_add: rcs0 fence 401b:4
<0>[  425.800821] i915_sel-4429    0.... 443464743us : i915_request_add: marking (null) as active
<0>[  425.800821] ksoftirq-21      2d.s1 443464744us : process_csb: process_csb:1103 GEM_BUG_ON(!i915_request_completed(rq))
<0>[  425.800821] ---------------------------------
<4>[  425.800821] ---[ end trace 56491dea06ff360f ]---
<4>[  426.373035] RIP: 0010:process_csb+0x640/0x9a0 [i915]
<4>[  426.373035] Code: ef 56 4a e0 48 8b 35 8f 23 1b 00 49 c7 c0 b6 54 d7 a0 b9 4f 04 00 00 48 c7 c2 e0 8e d5 a0 48 c7 c7 9b ee c7 a0 e8 a0 eb 50 e0 <0f> 0b 48 c7 c1 e8 42 d9 a0 ba 29 04 00 00 48 c7 c6 e0 8e d5 a0 48
<4>[  426.373035] RSP: 0018:ffffc90000153d28 EFLAGS: 00010082
<4>[  426.373035] RAX: 000000000000000b RBX: 0000000000000002 RCX: 0000000000000000
<4>[  426.373035] RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffff8884ae2574e8
<4>[  426.373035] RBP: ffffc90000153d98 R08: 0000000000127878 R09: ffff8884ae398000
<4>[  426.373035] R10: ffffc90000153cb8 R11: ffff8884ae2574e8 R12: 0000000000000000
<4>[  426.373035] R13: ffff88845b4bc2a8 R14: 0000000000000001 R15: ffff888492f21040
<4>[  426.373035] FS:  0000000000000000(0000) GS:ffff8884aff00000(0000) knlGS:0000000000000000
<4>[  426.373035] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[  426.373035] CR2: 00007f5b2ad345a0 CR3: 00000004a8540001 CR4: 0000000000760ee0
<4>[  426.373035] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[  426.373035] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
<4>[  426.373035] PKRU: 55555554
<0>[  426.373035] Kernel panic - not syncing: Fatal exception in interrupt
<0>[  426.373035] Shutting down cpus with NMI
<0>[  426.373035] Dumping ftrace buffer:
<0>[  426.373035]    (ftrace buffer empty)
<0>[  426.373035] Kernel Offset: disabled
Comment 1 CI Bug Log 2019-02-13 22:23:14 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* CFL ICL: igt@i915_selftest@live_workarounds - incomplete
  - https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_3590/fi-cfl-8109u/igt@i915_selftest@live_workarounds.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_3592/fi-cfl-8109u/igt@i915_selftest@live_workarounds.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_3592/fi-cfl-8700k/igt@i915_selftest@live_workarounds.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_3596/fi-cfl-8109u/igt@i915_selftest@live_workarounds.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_3596/fi-cfl-guc/igt@i915_selftest@live_workarounds.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_3596/fi-icl-u2/igt@i915_selftest@live_workarounds.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_3596/fi-icl-u3/igt@i915_selftest@live_workarounds.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_3617/fi-cfl-8109u/igt@i915_selftest@live_workarounds.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_3617/fi-cfl-8700k/igt@i915_selftest@live_workarounds.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_3617/fi-cfl-guc/igt@i915_selftest@live_workarounds.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_3617/fi-icl-u2/igt@i915_selftest@live_workarounds.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_3617/fi-icl-u3/igt@i915_selftest@live_workarounds.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_12200/fi-icl-u2/igt@i915_selftest@live_workarounds.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2380/fi-icl-u2/igt@i915_selftest@live_workarounds.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2380/fi-icl-u3/igt@i915_selftest@live_workarounds.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5594/fi-icl-u3/igt@i915_selftest@live_workarounds.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5595/fi-icl-u3/igt@i915_selftest@live_workarounds.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_12209/fi-icl-u2/igt@i915_selftest@live_workarounds.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2391/fi-icl-u3/igt@i915_selftest@live_workarounds.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4824/fi-icl-u2/igt@i915_selftest@live_workarounds.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_12213/fi-cfl-8109u/igt@i915_selftest@live_workarounds.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_12213/fi-icl-u3/igt@i915_selftest@live_workarounds.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_3845/fi-icl-u2/igt@i915_selftest@live_workarounds.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2387/fi-icl-u2/igt@i915_selftest@live_workarounds.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_3848/fi-icl-u2/igt@i915_selftest@live_workarounds.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_3848/fi-icl-u3/igt@i915_selftest@live_workarounds.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5598/fi-cfl-8109u/igt@i915_selftest@live_workarounds.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5598/fi-icl-u3/igt@i915_selftest@live_workarounds.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2394/fi-icl-u2/igt@i915_selftest@live_workarounds.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5599/fi-icl-u2/igt@i915_selftest@live_workarounds.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5599/fi-icl-u3/igt@i915_selftest@live_workarounds.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4825/fi-icl-u3/igt@i915_selftest@live_workarounds.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2396/fi-icl-u3/igt@i915_selftest@live_workarounds.html
Comment 2 Chris Wilson 2019-02-13 22:27:51 UTC
(In reply to CI Bug Log from comment #1)
> The CI Bug Log issue associated to this bug has been updated.
> 
> ### New filters associated
> 
> * CFL ICL: igt@i915_selftest@live_workarounds - incomplete
...

Only a few of those are this very specific bug.
Comment 3 Chris Wilson 2019-02-15 09:54:26 UTC
commit c836eb79c033c2be13aa8b41729b28d2ab1f72ab (HEAD -> drm-intel-next-queued, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Feb 13 22:48:05 2019 +0000

    drm/i915/selftests: Always use an active engine while resetting
    
    Currently, we only try to reset a live engine for checking the whitelist
    retention across a per-engine reset. For safety, it appears we need to
    prime the system with a hanging spinner before performing a full-device
    reset. (Figuring out the root cause behind the instability with handling
    a reset during a no-op request is a challenge for another test, the
    whitelist test has its own purpose.)
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109626
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Link: https://patchwork.freedesktop.org/patch/msgid/20190213224805.32021-1-chris@chris-wilson.co.uk
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>

should prevent it from occurring, and CI hints that

commit 9a3b19a16dc28ab717cf1663d09ffee0715b735a
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Feb 13 23:20:47 2019 +0000

    drm/i915: Only try to park engines after a failed reset
    
    Currently we try to stop the engine by programming the ring registers to
    be disabled before we perform the reset. Sometimes, we see the context
    image also have invalid ring registers, which one presumes may be
    actually caused by us doing so. Lets risk not doing programming the
    ring to zero on the first attempt to avoid preserving that corruption
    into the context image, leaving the w/a in place for subsequent
    reset attempts.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@intel.com>
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20190213232047.8486-1-chris@chris-wilson.co.uk

might be the real deal.
Comment 4 Martin Peres 2019-03-06 15:44:20 UTC
(In reply to Chris Wilson from comment #3)
> commit c836eb79c033c2be13aa8b41729b28d2ab1f72ab (HEAD ->
> drm-intel-next-queued, drm-intel/drm-intel-next-queued)
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Wed Feb 13 22:48:05 2019 +0000
> 
>     drm/i915/selftests: Always use an active engine while resetting
>     
>     Currently, we only try to reset a live engine for checking the whitelist
>     retention across a per-engine reset. For safety, it appears we need to
>     prime the system with a hanging spinner before performing a full-device
>     reset. (Figuring out the root cause behind the instability with handling
>     a reset during a no-op request is a challenge for another test, the
>     whitelist test has its own purpose.)
>     
>     Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109626
>     Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>     Link:
> https://patchwork.freedesktop.org/patch/msgid/20190213224805.32021-1-
> chris@chris-wilson.co.uk
>     Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> 
> should prevent it from occurring, and CI hints that
> 
> commit 9a3b19a16dc28ab717cf1663d09ffee0715b735a
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Wed Feb 13 23:20:47 2019 +0000
> 
>     drm/i915: Only try to park engines after a failed reset
>     
>     Currently we try to stop the engine by programming the ring registers to
>     be disabled before we perform the reset. Sometimes, we see the context
>     image also have invalid ring registers, which one presumes may be
>     actually caused by us doing so. Lets risk not doing programming the
>     ring to zero on the first attempt to avoid preserving that corruption
>     into the context image, leaving the w/a in place for subsequent
>     reset attempts.
>     
>     Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>     Cc: Mika Kuoppala <mika.kuoppala@intel.com>
>     Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>     Link:
> https://patchwork.freedesktop.org/patch/msgid/20190213232047.8486-1-
> chris@chris-wilson.co.uk
> 
> might be the real deal.

No idea if this was what fixed it, but it most definitely is fixed! Thanks!
Comment 5 CI Bug Log 2019-03-06 15:45:30 UTC
The CI Bug Log issue associated to this bug has been archived.

New failures matching the above filters will not be associated to this bug anymore.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.