Bug 110998 - [gen9] Hang recovery fails for atomic+textureBuffer hang
Summary: [gen9] Hang recovery fails for atomic+textureBuffer hang
Status: RESOLVED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: Other All
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: Triaged
Keywords:
: 109020 (view as bug list)
Depends on:
Blocks:
 
Reported: 2019-06-25 18:50 UTC by Jason Ekstrand
Modified: 2019-09-12 20:11 UTC (History)
2 users (show)

See Also:
i915 platform: CFL, KBL, SKL
i915 features: GPU hang


Attachments

Description Jason Ekstrand 2019-06-25 18:50:35 UTC
In https://bugs.freedesktop.org/show_bug.cgi?id=110228 we have a reported bug where the GPU hangs in certain L3$ atomic + textureBuffer cases.  The fact that it's hanging is most likely a userspace mesa bug.  However, I have never seen a system come back from this hang on any kernel I've tried it on in the last two years.  (The bug has only been open for 6 months but I've experienced the exact same hang as far back as two years ago.)  The fact that it's hard-hanging is definitely a kernel bug.

Steps to reproduce:

 1. Install any version of the Linux Vulkan driver (we've had this hang since about forever so a moderately recent distro driver should work.)
 2. Download reproducer_v2 from #110228
 3. Build with cmake
 4. Run ./cs_hang 512
 5. Watch it burn

This hang is present on at least SKL, KBL, and CFL but probably also APL.  It is not present on BDW or ICL so it can't be tested there (though the recovery path might still have a bug on those platforms).
Comment 1 Chris Wilson 2019-06-25 22:20:55 UTC
Looks like another situation where asking for a GPU reset turns into a request for a power cycle.
Comment 2 Chris Wilson 2019-06-26 09:15:53 UTC
It doesn't make any difference whether we ask for a full reset or rcs0 reset; with or without rings disabled; it dies a few milliseconds after asserting the GDRST. At the moment, the only way out is not to fall in.
Comment 3 Chris Wilson 2019-06-26 10:57:28 UTC
Fwiw, the GPU hang does not present itself on bxt. Just an issue in skl and its derivatives, already fixed in icl.
Comment 4 Chris Wilson 2019-07-20 12:48:25 UTC
Disable atomics in L3; no hang (kbl).

diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.c b/drivers/gpu/drm/i915/g
t/intel_workarounds.c
index 704ace01e7f5..890a3bcfacea 100644
--- a/drivers/gpu/drm/i915/gt/intel_workarounds.c
+++ b/drivers/gpu/drm/i915/gt/intel_workarounds.c
@@ -667,6 +667,10 @@ gen9_gt_workarounds_init(struct drm_i915_private *i915, stru
ct i915_wa_list *wal
                            MMCD_PCLA | MMCD_HOTSPOT_EN);
        }
 
+       wa_write_masked_or(wal, _MMIO(0xb008), BIT(0), 0);
+       wa_write_masked_or(wal, _MMIO(0xb118), BIT(22), 0);
+       wa_write_masked_or(wal, _MMIO(0xb11c), BIT(8), 0);
+
        /* WaDisableHDCInvalidation:skl,bxt,kbl,cfl */
        wa_write_or(wal,
                    GAM_ECOCHK,
Comment 5 Jason Ekstrand 2019-07-20 12:58:54 UTC
I'm aware that disabling L3$ for atomics fixes the hang (though I hadn't realized you could disable L3$ for just atomics that easily).  This bug is about the fact that the kernel fails to recover.
Comment 6 Chris Wilson 2019-07-20 13:10:49 UTC
(In reply to Jason Ekstrand from comment #5)
> I'm aware that disabling L3$ for atomics fixes the hang (though I hadn't
> realized you could disable L3$ for just atomics that easily).  This bug is
> about the fact that the kernel fails to recover.

The HW dies.
Comment 7 Chris Wilson 2019-08-08 16:37:28 UTC
Courtesy of Jason, https://patchwork.freedesktop.org/series/64920/
Comment 8 Denis 2019-09-04 09:29:57 UTC
Hi Jason, Chris, I tested this patch with one more game https://github.com/doitsujin/dxvk/issues/794
And it also fixed the hang. Is there any reasons not to push this patch?
Comment 9 Chris Wilson 2019-09-04 10:04:55 UTC
(In reply to Denis from comment #8)
> Hi Jason, Chris, I tested this patch with one more game
> https://github.com/doitsujin/dxvk/issues/794
> And it also fixed the hang. Is there any reasons not to push this patch?

We were waiting for confirmation that it helped UE4 and not just piglit. Thanks.
Comment 10 Chris Wilson 2019-09-04 11:50:32 UTC
Hang and subsequent death avoided by

commit 9d7b01e93526efe79dbf75b69cc5972b5a4f7b37 (HEAD -> drm-intel-next-queued, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Sep 4 11:07:07 2019 +0100

    drm/i915: Restore relaxed padding (OCL_OOB_SUPPRES_ENABLE) for skl+
    
    This bit was fliped on for "syncing dependencies between camera and
    graphics". BSpec has no recollection why, and it is causing
    unrecoverable GPU hangs with Vulkan compute workloads.
    
    From BSpec, setting bit5 to 0 enables relaxed padding requirements for
    buffers, 1D and 2D non-array, non-MSAA, non-mip-mapped linear surfaces;
    and *must* be set to 0h on skl+ to ensure "Out of Bounds" case is
    suppressed.
    
    Reported-by: Jason Ekstrand <jason@jlekstrand.net>
    Suggested-by: Jason Ekstrand <jason@jlekstrand.net>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110998
    Fixes: 8424171e135c ("drm/i915/gen9: h/w w/a: syncing dependencies between camera and graphics")
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Tested-by: denys.kostin@globallogic.com
    Cc: Jason Ekstrand <jason@jlekstrand.net>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Cc: <stable@vger.kernel.org> # v4.1+
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20190904100707.7377-1-chris@chris-wilson.co.uk

Solves the immediate test case.
Comment 11 Jonathan Farrugia 2019-09-12 20:11:04 UTC
*** Bug 109020 has been marked as a duplicate of this bug. ***


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.