In https://bugs.freedesktop.org/show_bug.cgi?id=110228 we have a reported bug where the GPU hangs in certain L3$ atomic + textureBuffer cases. The fact that it's hanging is most likely a userspace mesa bug. However, I have never seen a system come back from this hang on any kernel I've tried it on in the last two years. (The bug has only been open for 6 months but I've experienced the exact same hang as far back as two years ago.) The fact that it's hard-hanging is definitely a kernel bug.
Steps to reproduce:
1. Install any version of the Linux Vulkan driver (we've had this hang since about forever so a moderately recent distro driver should work.)
2. Download reproducer_v2 from #110228
3. Build with cmake
4. Run ./cs_hang 512
5. Watch it burn
This hang is present on at least SKL, KBL, and CFL but probably also APL. It is not present on BDW or ICL so it can't be tested there (though the recovery path might still have a bug on those platforms).
Looks like another situation where asking for a GPU reset turns into a request for a power cycle.
It doesn't make any difference whether we ask for a full reset or rcs0 reset; with or without rings disabled; it dies a few milliseconds after asserting the GDRST. At the moment, the only way out is not to fall in.
Fwiw, the GPU hang does not present itself on bxt. Just an issue in skl and its derivatives, already fixed in icl.
Disable atomics in L3; no hang (kbl).
diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.c b/drivers/gpu/drm/i915/g
index 704ace01e7f5..890a3bcfacea 100644
@@ -667,6 +667,10 @@ gen9_gt_workarounds_init(struct drm_i915_private *i915, stru
ct i915_wa_list *wal
MMCD_PCLA | MMCD_HOTSPOT_EN);
+ wa_write_masked_or(wal, _MMIO(0xb008), BIT(0), 0);
+ wa_write_masked_or(wal, _MMIO(0xb118), BIT(22), 0);
+ wa_write_masked_or(wal, _MMIO(0xb11c), BIT(8), 0);
/* WaDisableHDCInvalidation:skl,bxt,kbl,cfl */
I'm aware that disabling L3$ for atomics fixes the hang (though I hadn't realized you could disable L3$ for just atomics that easily). This bug is about the fact that the kernel fails to recover.
(In reply to Jason Ekstrand from comment #5)
> I'm aware that disabling L3$ for atomics fixes the hang (though I hadn't
> realized you could disable L3$ for just atomics that easily). This bug is
> about the fact that the kernel fails to recover.
The HW dies.
Courtesy of Jason, https://patchwork.freedesktop.org/series/64920/
Hi Jason, Chris, I tested this patch with one more game https://github.com/doitsujin/dxvk/issues/794
And it also fixed the hang. Is there any reasons not to push this patch?
(In reply to Denis from comment #8)
> Hi Jason, Chris, I tested this patch with one more game
> And it also fixed the hang. Is there any reasons not to push this patch?
We were waiting for confirmation that it helped UE4 and not just piglit. Thanks.
Hang and subsequent death avoided by
commit 9d7b01e93526efe79dbf75b69cc5972b5a4f7b37 (HEAD -> drm-intel-next-queued, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <firstname.lastname@example.org>
Date: Wed Sep 4 11:07:07 2019 +0100
drm/i915: Restore relaxed padding (OCL_OOB_SUPPRES_ENABLE) for skl+
This bit was fliped on for "syncing dependencies between camera and
graphics". BSpec has no recollection why, and it is causing
unrecoverable GPU hangs with Vulkan compute workloads.
From BSpec, setting bit5 to 0 enables relaxed padding requirements for
buffers, 1D and 2D non-array, non-MSAA, non-mip-mapped linear surfaces;
and *must* be set to 0h on skl+ to ensure "Out of Bounds" case is
Reported-by: Jason Ekstrand <email@example.com>
Suggested-by: Jason Ekstrand <firstname.lastname@example.org>
Fixes: 8424171e135c ("drm/i915/gen9: h/w w/a: syncing dependencies between camera and graphics")
Signed-off-by: Chris Wilson <email@example.com>
Cc: Jason Ekstrand <firstname.lastname@example.org>
Cc: Mika Kuoppala <email@example.com>
Cc: <firstname.lastname@example.org> # v4.1+
Reviewed-by: Mika Kuoppala <email@example.com>
Solves the immediate test case.
*** Bug 109020 has been marked as a duplicate of this bug. ***