In https://bugs.freedesktop.org/show_bug.cgi?id=110228 we have a reported bug where the GPU hangs in certain L3$ atomic + textureBuffer cases. The fact that it's hanging is most likely a userspace mesa bug. However, I have never seen a system come back from this hang on any kernel I've tried it on in the last two years. (The bug has only been open for 6 months but I've experienced the exact same hang as far back as two years ago.) The fact that it's hard-hanging is definitely a kernel bug.
Steps to reproduce:
1. Install any version of the Linux Vulkan driver (we've had this hang since about forever so a moderately recent distro driver should work.)
2. Download reproducer_v2 from #110228
3. Build with cmake
4. Run ./cs_hang 512
5. Watch it burn
This hang is present on at least SKL, KBL, and CFL but probably also APL. It is not present on BDW or ICL so it can't be tested there (though the recovery path might still have a bug on those platforms).
Looks like another situation where asking for a GPU reset turns into a request for a power cycle.
It doesn't make any difference whether we ask for a full reset or rcs0 reset; with or without rings disabled; it dies a few milliseconds after asserting the GDRST. At the moment, the only way out is not to fall in.
Fwiw, the GPU hang does not present itself on bxt. Just an issue in skl and its derivatives, already fixed in icl.
Disable atomics in L3; no hang (kbl).
diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.c b/drivers/gpu/drm/i915/g
index 704ace01e7f5..890a3bcfacea 100644
@@ -667,6 +667,10 @@ gen9_gt_workarounds_init(struct drm_i915_private *i915, stru
ct i915_wa_list *wal
MMCD_PCLA | MMCD_HOTSPOT_EN);
+ wa_write_masked_or(wal, _MMIO(0xb008), BIT(0), 0);
+ wa_write_masked_or(wal, _MMIO(0xb118), BIT(22), 0);
+ wa_write_masked_or(wal, _MMIO(0xb11c), BIT(8), 0);
/* WaDisableHDCInvalidation:skl,bxt,kbl,cfl */
I'm aware that disabling L3$ for atomics fixes the hang (though I hadn't realized you could disable L3$ for just atomics that easily). This bug is about the fact that the kernel fails to recover.
(In reply to Jason Ekstrand from comment #5)
> I'm aware that disabling L3$ for atomics fixes the hang (though I hadn't
> realized you could disable L3$ for just atomics that easily). This bug is
> about the fact that the kernel fails to recover.
The HW dies.
Courtesy of Jason, https://patchwork.freedesktop.org/series/64920/