Bug 105358

Summary: [CI] igt@gem_eio@in-flight* - incomplete - i915_request_retire:390 GEM_BUG_ON(!i915_request_completed(request))
Product: DRI Reporter: Marta Löfstedt <marta.lofstedt>
Component: DRM/IntelAssignee: Marta Löfstedt <marta.lofstedt>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: intel-gfx-bugs
Version: DRI git   
Hardware: Other   
OS: All   
See Also: https://bugs.freedesktop.org/show_bug.cgi?id=105341
Whiteboard: ReadyForDev
i915 platform: KBL i915 features: GEM/Other

Description Marta Löfstedt 2018-03-06 07:36:27 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3875/shard-kbl4/igt@gem_eio@in-flight-contexts.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3875/shard-kbl7/igt@gem_eio@in-flight.html

From pstore:
<3>[   23.739978] i915_request_retire:390 GEM_BUG_ON(!i915_request_completed(request))
<4>[   23.740062] ------------[ cut here ]------------
<2>[   23.740064] kernel BUG at drivers/gpu/drm/i915/i915_request.c:390!
<4>[   23.740088] invalid opcode: 0000 [#1] PREEMPT SMP PTI
<0>[   23.740095] Dumping ftrace buffer:
Comment 1 Chris Wilson 2018-03-06 08:01:32 UTC
Same basic problem as bug 105341
Comment 2 Marta Löfstedt 2018-03-06 08:04:26 UTC
(In reply to Chris Wilson from comment #1)
> Same basic problem as bug 105341

Yeah, I think so to, but since the backtrace look so different I think it is better to not dup the bugs, or?
Comment 3 Chris Wilson 2018-03-06 08:32:50 UTC
To actually reconstruct the original bug where the guc was executing contexts out of order, you have to disable trickle feeding the guc.
Comment 4 Chris Wilson 2018-03-06 08:33:13 UTC
Oops, wrong bug.
Comment 5 Chris Wilson 2018-03-16 10:18:53 UTC
commit ac697ae8013a7c7301174c9c3b02a92fe418b7ea
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Mar 15 15:10:15 2018 +0000

    drm/i915: Stop engines when declaring the machine wedged
    
    If we fail to reset the GPU, we declare the machine wedged. However, the
    GPU may well still be running in the background with an in-flight
    request. So despite our efforts in cleaning up the request queue and
    faking the breadcrumb in the HWSP, the GPU may eventually write the
    in-flght seqno there breaking all of our assumptions and throwing the
    driver into a deep turmoil, wedging beyond wedged.
    
    To avoid this we ideally want to reset the GPU. Since that has already
    failed, make sure the rings have the stop bit set instead. This is part
    of the normal GPU reset sequence, but that is actually disabled by
    igt/gem_eio to force the wedged state. If we assume the worst, we must
    poke at the bit again before we give up.
    
    v2: Move the intel_gpu_reset() from set-wedged in the reset error path
    into i915_gem_set_wedged() itself. Even if the reset fails (e.g. if it is
    disabled by gem_eio), it still tries to make sure the engines are
    stopped. For i915_gem_set_wedged() callers from outside of i915_reset(),
    this should make sure the GPU is disabled while the driver is marked as
    being wedged.
    
    Testcase: igt/gem_eio
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Cc: Michał Winiarski <michal.winiarski@intel.com>
    Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
    Cc: Michel Thierry <michel.thierry@intel.com>
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180315151015.22741-1-chris@chris-wilson.co.uk

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.