Bug 110998 - [gen9] Hang recovery fails for atomic+textureBuffer hang
Summary: [gen9] Hang recovery fails for atomic+textureBuffer hang
Status: NEW
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: Other All
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: Triaged
Keywords:
Depends on:
Blocks:
 
Reported: 2019-06-25 18:50 UTC by Jason Ekstrand
Modified: 2019-08-08 16:37 UTC (History)
1 user (show)

See Also:
i915 platform: CFL, KBL, SKL
i915 features: GPU hang


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jason Ekstrand 2019-06-25 18:50:35 UTC
In https://bugs.freedesktop.org/show_bug.cgi?id=110228 we have a reported bug where the GPU hangs in certain L3$ atomic + textureBuffer cases.  The fact that it's hanging is most likely a userspace mesa bug.  However, I have never seen a system come back from this hang on any kernel I've tried it on in the last two years.  (The bug has only been open for 6 months but I've experienced the exact same hang as far back as two years ago.)  The fact that it's hard-hanging is definitely a kernel bug.

Steps to reproduce:

 1. Install any version of the Linux Vulkan driver (we've had this hang since about forever so a moderately recent distro driver should work.)
 2. Download reproducer_v2 from #110228
 3. Build with cmake
 4. Run ./cs_hang 512
 5. Watch it burn

This hang is present on at least SKL, KBL, and CFL but probably also APL.  It is not present on BDW or ICL so it can't be tested there (though the recovery path might still have a bug on those platforms).
Comment 1 Chris Wilson 2019-06-25 22:20:55 UTC
Looks like another situation where asking for a GPU reset turns into a request for a power cycle.
Comment 2 Chris Wilson 2019-06-26 09:15:53 UTC
It doesn't make any difference whether we ask for a full reset or rcs0 reset; with or without rings disabled; it dies a few milliseconds after asserting the GDRST. At the moment, the only way out is not to fall in.
Comment 3 Chris Wilson 2019-06-26 10:57:28 UTC
Fwiw, the GPU hang does not present itself on bxt. Just an issue in skl and its derivatives, already fixed in icl.
Comment 4 Chris Wilson 2019-07-20 12:48:25 UTC
Disable atomics in L3; no hang (kbl).

diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.c b/drivers/gpu/drm/i915/g
t/intel_workarounds.c
index 704ace01e7f5..890a3bcfacea 100644
--- a/drivers/gpu/drm/i915/gt/intel_workarounds.c
+++ b/drivers/gpu/drm/i915/gt/intel_workarounds.c
@@ -667,6 +667,10 @@ gen9_gt_workarounds_init(struct drm_i915_private *i915, stru
ct i915_wa_list *wal
                            MMCD_PCLA | MMCD_HOTSPOT_EN);
        }
 
+       wa_write_masked_or(wal, _MMIO(0xb008), BIT(0), 0);
+       wa_write_masked_or(wal, _MMIO(0xb118), BIT(22), 0);
+       wa_write_masked_or(wal, _MMIO(0xb11c), BIT(8), 0);
+
        /* WaDisableHDCInvalidation:skl,bxt,kbl,cfl */
        wa_write_or(wal,
                    GAM_ECOCHK,
Comment 5 Jason Ekstrand 2019-07-20 12:58:54 UTC
I'm aware that disabling L3$ for atomics fixes the hang (though I hadn't realized you could disable L3$ for just atomics that easily).  This bug is about the fact that the kernel fails to recover.
Comment 6 Chris Wilson 2019-07-20 13:10:49 UTC
(In reply to Jason Ekstrand from comment #5)
> I'm aware that disabling L3$ for atomics fixes the hang (though I hadn't
> realized you could disable L3$ for just atomics that easily).  This bug is
> about the fact that the kernel fails to recover.

The HW dies.
Comment 7 Chris Wilson 2019-08-08 16:37:28 UTC
Courtesy of Jason, https://patchwork.freedesktop.org/series/64920/


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.