108820 – [SKL] (recoverable) GPU hangs in benchmarks using compute shaders

Bug 108820 - [SKL] (recoverable) GPU hangs in benchmarks using compute shaders

Summary: [SKL] (recoverable) GPU hangs in benchmarks using compute shaders

Status:	RESOLVED WORKSFORME

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/DRI/i965 (show other bugs)
Version:	git
Hardware:	Other All

Importance:	medium normal
Assignee:	Intel 3D Bugs Mailing List
QA Contact:	Intel 3D Bugs Mailing List

URL:
Whiteboard:
Keywords:	regression

Depends on:
Blocks:	mesa-19.0
	Show dependency tree / graph

Reported:	2018-11-21 11:34 UTC by Eero Tamminen
Modified:	2019-08-12 14:04 UTC (History)
CC List:	2 users (show)

See Also:	103556
i915 platform:
i915 features:

Attachments
CarChase GPU hang (119.46 KB, text/plain) 2019-01-03 10:52 UTC, Eero Tamminen	Details
INTEL_DEBUG=cs,do32 output for the 65 local workgroup size case (GPU hang) (202.14 KB, text/plain) 2019-01-09 21:51 UTC, Jakub Okoński	Details
INTEL_DEBUG=cs,do32 output for the 64 local workgroup size case (working) (201.50 KB, text/plain) 2019-01-09 21:51 UTC, Jakub Okoński	Details
MESA_GLSL_CACHE_DISABLE=true INTEL_DEBUG=cs with 65 items (failing) (149.49 KB, text/plain) 2019-01-10 15:36 UTC, Jakub Okoński	Details
MESA_GLSL_CACHE_DISABLE=true INTEL_DEBUG=cs with 64 items (working) (149.03 KB, text/plain) 2019-01-10 15:36 UTC, Jakub Okoński	Details
MESA_GLSL_CACHE_DISABLE=true INTEL_DEBUG=cs for the minimal repro shader (8.48 KB, text/plain) 2019-01-10 18:25 UTC, Jakub Okoński	Details
SKL GT3e CarChase GPU hang (119.65 KB, text/plain) 2019-02-05 11:10 UTC, Eero Tamminen	Details
/sys/class/drm/card0/error Jakub (48.00 KB, text/plain) 2019-02-05 13:02 UTC, Jakub Okoński	Details
SKL GT3e CarChase GPU hang (mesa: b3aa37046b) (123.65 KB, text/plain) 2019-03-20 13:04 UTC, Eero Tamminen	Details
View All

Description Eero Tamminen 2018-11-21 11:34:02 UTC

Setup:
* SKL GT2 / GT3e
* Ubuntu 18.04
* *drm-tip* v4.19 kernel
* Mesa & X git head

Test-case:
* Run a test-case using compute shaders

Expected output:
* No GPU hangs (like with earlier Mesa commits)

Actual output:
* Recoverable GPU hangs in compute shader using test-cases:
  - GfxBench Aztec Ruins, CarChase and Manhattan 3.1
  - Sacha Willems' Vulkan compute demos
  - SynMark CSDof / CSCloth
* Vulkan compute demos fail to run (other tests run successfully despite hangs)

This seems to be SKL specific, it's not visible on other HW.

This regression happened between following Mesa commits:
* dca35c598d: 2018-11-19 15:57:41: intel/fs,vec4: Fix a compiler warning
* a999798daa: 2018-11-20 17:09:22: meson: Add tests to suites

It also seems to be specific to *drm-tip* v4.19.0 kernel as I don't see it with latest drm-tip v4.20.0-rc3 kernel.  So it's also possible that it's a bug in i915, that just gets triggered by Mesa change, and which got fixed later.


Sacha Willems' Vulkan Raytracing demo outputs following on first run:
---------------------------------
SPIR-V WARNING:
    In file src/compiler/spirv/vtn_variables.c:1897
    Source and destination types of SpvOpStore do not have the same ID (but are compatible): 225 vs 212
    14920 bytes into the SPIR-V binary
SPIR-V WARNING:
    In file src/compiler/spirv/vtn_variables.c:1897
    Source and destination types of SpvOpStore do not have the same ID (but are compatible): 225 vs 212
    10300 bytes into the SPIR-V binary
SPIR-V WARNING:
    In file src/compiler/spirv/vtn_variables.c:1897
    Source and destination types of SpvOpStore do not have the same ID (but are compatible): 269 vs 256
    10944 bytes into the SPIR-V binary
SPIR-V WARNING:
    In file src/compiler/spirv/vtn_variables.c:1897
    Source and destination types of SpvOpStore do not have the same ID (but are compatible): 225 vs 212
    11920 bytes into the SPIR-V binary
INTEL-MESA: error: src/intel/vulkan/anv_device.c:2091: GPU hung on one of our command buffers (VK_ERROR_DEVICE_LOST)
vulkan_raytracing: base/vulkanexamplebase.cpp:651: void VulkanExampleBase::submitFrame(): Assertion `res == VK_SUCCESS' failed.
-----------------------------

(Other runs show just the error and assert.)

Comment 1 Mark Janes 2018-11-21 18:56:22 UTC

Since this bug is limited to a drm-tip kernel, it seems likely that the problem is in the kernel, not in mesa.  Can you reproduce it on any released kernel?

Comment 2 Eero Tamminen 2018-11-28 14:41:51 UTC

(In reply to Eero Tamminen from comment #0)
> It also seems to be specific to *drm-tip* v4.19.0 kernel as I don't see it
> with latest drm-tip v4.20.0-rc3 kernel.  So it's also possible that it's a
> bug in i915, that just gets triggered by Mesa change, and which got fixed
> later.

I've now seen hangs also with drm-tip v4.20.0-rc3 kernel.


However, these GPU hangs don't happen anymore with this or later Mesa commit
(regardless of whether they're with v1.19 or v4.20-rc4 drm-tip kernels):
3c96a1e3a97ba 2018-11-26 08-29-39: radv: Fix opaque metadata descriptor last layer

-> FIXED?

(I'm lacking data for several previous days, so I can't give an exact time when those hangs stopped.)

Raytracing demo SPIR-V warnings happen still, although I updated Sacha Willem's demos to latest Git version.

Comment 3 Eero Tamminen 2018-11-30 16:47:18 UTC

Sorry, all the hangs have happened with drm-tip v4.20-rc versions, not v4.19.

Last night there were again recoverable hangs on SKL, with drm-tip v4.20-rc4:
* GfxBench v5-GOLD2 Aztec Ruins GL & Vulkan ("normal") versions
* Ungine Heaven v4.0
* SynMark v7 CSCloth

Heaven doesn't use compute shaders, so maybe the issue isn't compute related after all.

Comment 4 Jakub Okoński 2018-12-25 20:41:28 UTC

It also affects me on Skylake i7-6500U, Mesa 18.3.1 and kernel 4.19.12. I am able to reproduce it in my own vulkan app. If I don't dispatch any compute work, everything works fine. As soon as I submit a CB with dispatches, I get that same error:

INTEL-MESA: error: ../mesa-18.3.1/src/intel/vulkan/anv_device.c:2091: GPU hung on one of our command buffers (VK_ERROR_DEVICE_LOST)

Comment 5 Jakub Okoński 2018-12-25 20:44:00 UTC

I should clarify: when I said "as soon as I submit", I mean the vkQueueSubmit call exits with device lost error. Before it returns, my desktop freezes for a couple seconds (maybe I can move my mouse, but it doesn't render the new position while hung).

Comment 6 Lionel Landwerlin 2019-01-02 14:55:03 UTC

Could you attach the /sys/class/drm/card0/error file after you notice a hang?
Thanks!

Comment 7 Eero Tamminen 2019-01-03 10:52:23 UTC

Created attachment 142952 [details]
CarChase GPU hang

I'm now pretty sure it's drm-tip kernel issue.  It went away after v4.20-rc4 kernel version, at end of November, but still happens when using our last v4.19 drm-tip build (currently used in our Mesa tracking).

It seems to happen more frequently with SKL GT3e than GT2.  Attached is error state for GfxBench CarChase hang with Mesa (8c93ef5de98a9) from couple of days ago.

I've now updated our Mesa tracking to use v4.20 drm-tip build, I'll tell next week whether that helped (as expected).

Comment 8 Lionel Landwerlin 2019-01-03 12:25:15 UTC

I know this is going to be painful, but it would be really good to have a bisect on what commit broke this...
Skimming through the logs, I couldn't find anything between drm-tip/4.18-rc7 and drm-tip/4.20-rc4 that indicates a hang of this kind on gen9.

A bit later (4th of December) this fix appeared that could impact  :

commit 4a15c75c42460252a63d30f03b4766a52945fb47
Author: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Date:   Mon Dec 3 13:33:41 2018 +0000

    drm/i915: Introduce per-engine workarounds
    
    We stopped re-applying the GT workarounds after engine reset since commit
    59b449d5c82a ("drm/i915: Split out functions for different kinds of
    workarounds").
    
    Issue with this is that some of the GT workarounds live in the MMIO space
    which gets lost during engine resets. So far the registers in 0x2xxx and
    0xbxxx address range have been identified to be affected.
    
    This losing of applied workarounds has obvious negative effects and can
    even lead to hard system hangs (see the linked Bugzilla).
    
    Rather than just restoring this re-application, because we have also
    observed that it is not safe to just re-write all GT workarounds after
    engine resets (GPU might be live and weird hardware states can happen),
    we introduce a new class of per-engine workarounds and move only the
    affected GT workarounds over.
    
    Using the framework introduced in the previous patch, we therefore after
    engine reset, re-apply only the workarounds living in the affected MMIO
    address ranges.
    
    v2:
     * Move Wa_1406609255:icl to engine workarounds as well.
     * Rename API. (Chris Wilson)
     * Drop redundant IS_KABYLAKE. (Chris Wilson)
     * Re-order engine wa/ init so latest platforms are first. (Rodrigo Vivi)
    
    Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Bugzilla: https://bugzilla.freedesktop.org/show_bug.cgi?id=107945
    Fixes: 59b449d5c82a ("drm/i915: Split out functions for different kinds of workarounds")

Comment 9 Eero Tamminen 2019-01-07 10:03:30 UTC

Still seeing the hangs with latest Mesa and drm-tip 4.20 kernel, on SKL GT3e & GT4e.

It happens approximately on 1 out of 3 runs.

Seems to happen only with:
* Aztec Ruins (normal, FullHD resolution)
* Carchase, but only when it's run in 4K resolution (not in FullHD):
  testfw_app --gfx glfw --gl_api desktop_core --width 3840 --height 2160 --fullscreen 0 --test_id gl_4

Comment 10 Eero Tamminen 2019-01-08 09:20:23 UTC

(In reply to Jakub Okoński from comment #4)
> It also affects me on Skylake i7-6500U, Mesa 18.3.1 and kernel 4.19.12. I am
> able to reproduce it in my own vulkan app. If I don't dispatch any compute
> work, everything works fine. As soon as I submit a CB with dispatches, I get
> that same error:
> 
> INTEL-MESA: error: ../mesa-18.3.1/src/intel/vulkan/anv_device.c:2091: GPU
> hung on one of our command buffers (VK_ERROR_DEVICE_LOST)

Is this fully reproducible?  If yes, could you either attach your test-case, or (preferably :)) try bisecting it from Mesa?

As can be seen from above comments, this isn't reproducible enough in my test-cases that I could reliably bisect it (even whether it's Mesa or kernel issue).

Comment 11 Jakub Okoński 2019-01-08 20:33:50 UTC

I don't know about these other applications, I've been experiencing this issue in my own vulkan app. I had time today to mess around a bit more. First I removed my dispatches and it worked fine, then I brought them back and started simplifying my compute shader.

So far, I've been able to isolate the issue to a single `barrier()` GLSL call near the end of my shader. I have another barrier earlier - `memoryBarrierShared()` and it doesn't cause any issues. Perhaps this is isolated to control flow barriers in compute shaders?

I am preparing my code to serve as a repro case, I should have it soon, but I use a Rust toolchain so it might not be the easiest.

Comment 12 Jakub Okoński 2019-01-08 20:40:53 UTC

OK, I think I found the precise issue. It occurs when using a control flow barrier in the shader with more than 64 items in the workgroup.

To put in concrete terms:

Shader A:
...
layout (local_size_x = 64) in;
void main() {
    // code
    barrier();
}

----
Works fine.

Shader B:
...
layout (local_size_x = 65) in;
void main() {
    // code
    barrier();
}

----
Hangs with INTEL-MESA: error: src/intel/vulkan/anv_device.c:2091: GPU hung on one of our command buffers (VK_ERROR_DEVICE_LOST).


Shader C:
...
layout (local_size_x = 65) in;
void main() {
    // code
    // barrier(); without any control flow barriers inside
}
----
Works fine as well.

This should be enough to zoom into the issue, but if you need code you can execute and repro locally, let me know and I can deliver it.

Comment 13 Jakub Okoński 2019-01-08 20:50:00 UTC

Vulkan spec defines a minimum of 128 items in the first dimension of a workgroup, the driver reports maxComputeWorkGroupSize[0] = 896 so I think my application is well behaved in this case and should not hang because of limits.

Comment 14 Eero Tamminen 2019-01-09 12:24:12 UTC

(In reply to Jakub Okoński from comment #12)
> OK, I think I found the precise issue. It occurs when using a control flow
> barrier in the shader with more than 64 items in the workgroup.

Great, thanks!


(In reply to Jakub Okoński from comment #13)
> Vulkan spec defines a minimum of 128 items in the first dimension of a
> workgroup, the driver reports maxComputeWorkGroupSize[0] = 896 so I think my
> application is well behaved in this case and should not hang because of
> limits.

At certain workgroup size thresholds, at least the used SIMD mode can increase (threshold differs between platforms).

You can check whether there's a change in that between working and non-working cases with something like this:
  INTEL_DEBUG=cs <your test-case>  2> shader-ir-asm.txt
  grep ^SIMD shader-ir-asm.txt

If SIMD mode doesn't change, what's the diff between the shader IR/ASM output of the two versions?

Comment 15 Jakub Okoński 2019-01-09 21:50:13 UTC

I couldn't get the output of `INTEL_DEBUG=cs`, it returned an output once but I lost due to terminal scrollback. No matter how many times I ran it again, it never dumped the actual CS shader info.

I was successful when using `INTEL_DEBUG=cs,do32`, the combination of options prints the expected output every time. It doesn't change when my program works or crashes, so I hope that's OK.

So with the forced SIMD32 mode, codegen is still different and the issue remains. I'm attaching both outputs below (do32-failing.txt and do32-working.txt), here's the diff for generated native code:

 Native code for unnamed compute shader (null)
-SIMD32 shader: 496 instructions. 0 loops. 19586 cycles. 0:0 spills:fills. Promoted 0 constants. Compacted 7936 to 6560 bytes (17%)
+SIMD32 shader: 498 instructions. 0 loops. 19586 cycles. 0:0 spills:fills. Promoted 0 constants. Compacted 7968 to 6576 bytes (17%)
    START B0 (162 cycles)
 mov(8)          g4<1>UW         0x76543210V                     { align1 WE_all 1Q };
 mov(16)         g60<1>UD        g0.1<0,1,0>UD                   { align1 1H compacted };
@@ -4354,16 +4354,18 @@
 add(16)         g63<1>D         g8<8,8,1>D      g1.5<0,1,0>D    { align1 2H };
 add(16)         g3<1>UW         g4<16,16,1>UW   0x0010UW        { align1 WE_all 1H };
 mov(16)         g58<1>D         g4<8,8,1>UW                     { align1 1H };
-shl(16)         g68<1>D         g55<8,8,1>D     0x00000006UD    { align1 1H };
-shl(16)         g46<1>D         g63<8,8,1>D     0x00000006UD    { align1 2H };
+mul(16)         g68<1>D         g55<8,8,1>D     65D             { align1 1H compacted };
+mul(16)         g46<1>D         g63<8,8,1>D     65D             { align1 2H };
 shl(16)         g56<1>D         g2<0,1,0>D      0x00000005UD    { align1 1H };
 shl(16)         g64<1>D         g2<0,1,0>D      0x00000005UD    { align1 2H };
 mov(16)         g66<1>D         g3<8,8,1>UW                     { align1 2H };
 add(16)         g60<1>D         g58<8,8,1>D     g56<8,8,1>D     { align1 1H compacted };
-and(16)         g62<1>UD        g60<8,8,1>UD    0x0000003fUD    { align1 1H compacted };
+math intmod(8)  g62<1>UD        g60<8,8,1>UD    0x00000041UD    { align1 1Q compacted };
+math intmod(8)  g63<1>UD        g61<8,8,1>UD    0x00000041UD    { align1 2Q compacted };
 add.z.f0(16)    g76<1>D         g68<8,8,1>D     g62<8,8,1>D     { align1 1H compacted };
 add(16)         g68<1>D         g66<8,8,1>D     g64<8,8,1>D     { align1 2H };
-and(16)         g70<1>UD        g68<8,8,1>UD    0x0000003fUD    { align1 2H };
+math intmod(8)  g70<1>UD        g68<8,8,1>UD    0x00000041UD    { align1 3Q };
+math intmod(8)  g71<1>UD        g69<8,8,1>UD    0x00000041UD    { align1 4Q };
 add.z.f0(16)    g50<1>D         g46<8,8,1>D     g70<8,8,1>D     { align1 2H };
 (+f0) if(32)    JIP: 416        UIP: 416                        { align1 };
    END B0 ->B1 ->B2

Comment 16 Jakub Okoński 2019-01-09 21:51:01 UTC

Created attachment 143046 [details]
INTEL_DEBUG=cs,do32 output for the 65 local workgroup size case (GPU hang)

Comment 17 Jakub Okoński 2019-01-09 21:51:16 UTC

Created attachment 143047 [details]
INTEL_DEBUG=cs,do32 output for the 64 local workgroup size case (working)

Comment 18 Eero Tamminen 2019-01-10 11:53:46 UTC

(In reply to Jakub Okoński from comment #15)
> I couldn't get the output of `INTEL_DEBUG=cs`, it returned an output once
> but I lost due to terminal scrollback. No matter how many times I ran it
> again, it never dumped the actual CS shader info.

That's weird.

If shaders come from cache, they seem to be missing "SIMD" line, and some other info, but the actual assembly instructions should be there.

Do you get shader info, if you also disable cache:
  MESA_GLSL_CACHE_DISABLE=true INTEL_DEBUG=cs <use-case>
?

Comment 19 Jakub Okoński 2019-01-10 15:35:21 UTC

It helps and I get reliable output, although I don't understand how the GLSL caching option is relevant. I use vulkan without VkPipelineCache, and yet it is being cached anyway?

I'm attaching new debug outputs, the diffs look pretty similar, but they now use SIMD16 in both cases, with one more instruction in the failing shader.

Comment 20 Jakub Okoński 2019-01-10 15:36:24 UTC

Created attachment 143060 [details]
MESA_GLSL_CACHE_DISABLE=true INTEL_DEBUG=cs with 65 items (failing)

Comment 21 Jakub Okoński 2019-01-10 15:36:38 UTC

Created attachment 143061 [details]
MESA_GLSL_CACHE_DISABLE=true INTEL_DEBUG=cs with 64 items (working)

Comment 22 Eero Tamminen 2019-01-10 17:42:45 UTC

(In reply to Jakub Okoński from comment #19)
> It helps and I get reliable output, although I don't understand how the GLSL
> caching option is relevant.

Shader compiler is shared between Vulkan and GL drivers.

> I use vulkan without VkPipelineCache, and yet it
> is being cached anyway?

Aren't pipeline objects higher level concept than shaders?  Shader caching underneath should be invisible to the upper layers (except for performance and debug output), and not affect the resulting shader binaries (unless it's buggy).


Lionel, do you think the shader assembly changes between working and non-working work group sizes has anything to do with the hangs?  There's no code-flow difference, few instruction changes (shift -> mul, and -> 2x intmod) look innocent to me.


-> seems that minimal reproduction code would help. :-)

Comment 23 Jakub Okoński 2019-01-10 18:25:10 UTC

I'm trying to reduce the shader to bare-minimum that reproduces the hang, here it is:

```
#version 450

layout (local_size_x = 65) in;

void main() {
   if (gl_GlobalInvocationID.x >= 20) {
       return;
   }

   barrier();
}
```

New piece of information is the early-return that is required to trigger the hang. From this baseline shader, if I decrease local_size_x to 64, it works. Or I can remove the if statement and it will also work. Or I can remove the barrier() and it will also start to work. It seems to be a combination of these that causes the hang. If I take away any piece from it, the problem goes away.

I'm attaching the debug info on this minimal shader.

Comment 24 Jakub Okoński 2019-01-10 18:25:38 UTC

Created attachment 143062 [details]
MESA_GLSL_CACHE_DISABLE=true INTEL_DEBUG=cs for the minimal repro shader

Comment 25 Jakub Okoński 2019-01-10 18:30:59 UTC

I spoke a bit too soon. For this minimal shader, the local_size_x seems irrelevant and decreasing it doesn't prevent the hang. I wonder if this is the same issue or if I uncovered some other issue by reducing the test case.

Comment 26 Eero Tamminen 2019-01-16 11:56:21 UTC

Lionel, is the info from Jakub enough to reproduce the issue?


FYI: Last night git builds of drm-tip kernel, X and Mesa didn't have hangs on SKL, but all BXT devices had hard hang in Aztec Ruins Vulkan test-case.

(Just git version of Mesa + v4.20 drm-tip kernel wasn't enough to trigger it.)

Comment 27 Eero Tamminen 2019-01-17 09:10:35 UTC

(In reply to Eero Tamminen from comment #26)
> FYI: Last night git builds of drm-tip kernel, X and Mesa didn't have hangs
> on SKL, but all BXT devices had hard hang in Aztec Ruins Vulkan test-case.

This was unrelated (kernel) bug which seems to have been introduced and fixed (day later) by Chris.

Comment 28 Eero Tamminen 2019-02-04 12:00:02 UTC

Jakub, are still seeing the hangs with your test-cases?

I haven't see them for a while with my test-cases when using latest Mesa (and drm-tip kernel 2.20 or 5.0-rc).

There have been a couple of CS related Mesa fixes since I filed this bug:
----------------------------------------------------------------
commit fea5b8e5ad5042725cb52d6d37256b9185115502
Author:     Oscar Blumberg <carnaval@12-10e.me>
AuthorDate: Sat Jan 26 16:47:42 2019 +0100
Commit:     Kenneth Graunke <kenneth@whitecape.org>
CommitDate: Fri Feb 1 10:53:33 2019 -0800

    intel/fs: Fix memory corruption when compiling a CS
    
    Missing check for shader stage in the fs_visitor would corrupt the
    cs_prog_data.push information and trigger crashes / corruption later
    when uploading the CS state.
...
commit 31e4c9ce400341df9b0136419b3b3c73b8c9eb7e
Author:     Lionel Landwerlin <lionel.g.landwerlin@intel.com>
AuthorDate: Thu Jan 3 16:18:48 2019 +0000
Commit:     Lionel Landwerlin <lionel.g.landwerlin@intel.com>
CommitDate: Fri Jan 4 11:18:54 2019 +0000

    i965: add CS stall on VF invalidation workaround
    
    Even with the previous commit, hangs are still happening. The problem
    there is that the VF cache invalidate do happen immediately without
    waiting for previous rendering to complete. What happens is that we
    invalidate the cache the moment the PIPE_CONTROL is parsed but we
    still have old rendering in the pipe which continues to pull data into
    the cache with the old high address bits. The later rendering with the
    new high address bits then doesn't have the clean cache that it
    expects/needs.
----------------------------------------------------------------

If you're still seeing hangs and they're 100% reproducible, I think it would be better to file a separate bug about it, and get it bisected.

Comment 29 Jakub Okoński 2019-02-04 16:20:07 UTC

Still hangs on 5.0-rc5, I will try compiling latest mesa from git to see if that helps.

Comment 30 Jakub Okoński 2019-02-04 18:25:04 UTC

I tried on mesa 19.1.0 git revision 64d3b148fe7 and it also hangs, do you want me to create another issue?

Comment 31 Eero Tamminen 2019-02-05 11:10:36 UTC

Created attachment 143303 [details]
SKL GT3e CarChase GPU hang

Gah. Every time I comment that this seems to have gone, the very next day I get a new (recoverable) hang.  I.e. this happens nowadays *very* rarely.

This time there was one recoverable hang in GfxBench CarChase on SKL GT3e, no hangs on the other machines.  Like earlier, it doesn't have significant impact on performance.


(In reply to Jakub Okoński from comment #30)
> I tried on mesa 19.1.0 git revision 64d3b148fe7 and it also hangs, do you
> want me to create another issue?

Lionel, any comments?


Jakub, you could attach i915 error state from:
  /sys/class/drm/card0/error

So that Mesa developers can check whether your hangs happen in the same place as mine.

Comment 32 Jakub Okoński 2019-02-05 13:02:05 UTC

Created attachment 143305 [details]
/sys/class/drm/card0/error Jakub

Here you go.

Comment 33 Eero Tamminen 2019-02-06 10:55:03 UTC

(In reply to Jakub Okoński from comment #25)
> I spoke a bit too soon. For this minimal shader, the local_size_x seems
> irrelevant and decreasing it doesn't prevent the hang. I wonder if this is
> the same issue or if I uncovered some other issue by reducing the test case.

(In reply to Jakub Okoński from comment #32)
> Created attachment 143305 [details]
> /sys/class/drm/card0/error Jakub
> 
> Here you go.

If there are still two different ways of reliably triggering the hang, could you attach error output also for the other and name them so that they can be differentiated?

(E.g. "minimal conditional return + barrier case hang" and "large / local_size_x case hang".)

Comment 34 Jakub Okoński 2019-02-07 16:03:06 UTC

I hope I'm not going crazy, but on 5.0-rc5 with mesa 19.0-rc2, the goal post has moved to 32 shader invocations in a local group. So comment #12 is outdated when it comes to the number. Otherwise, the behavior is the same, it's the combination of > 32 items AND conditional early return statements that cause the hang.

So in the end I only have one repro case I think, it's this:

#version 450

layout (local_size_x = 33) in;

void main() {
    if (gl_GlobalInvocationID.x >= 20) {
        return;
    }

    barrier();
}

From here, I can do any of:
1) comment out barrier() call
2) comment out the return statement (the if can stay)
3) decrease local_size_x to 32

And it will prevent the crash from happening. The drm/card0 error that I uploaded on February 5th is the only crash I can provide.

Comment 35 Jakub Okoński 2019-02-07 16:06:35 UTC

I meant to say: it's the combination of > 32 items AND conditional early return statements AND a barrier that cause the hang.

I also checked replacing the barrier() call with memory barriers, and it prevents the crash. So only execution barriers are a component/contributor to this issue.

Comment 36 Mark Janes 2019-02-28 17:26:23 UTC

Should this be blocking the Mesa 19.0 release?  Why wouldn't we suspect a bug in drm-tip instead of Mesa?

Comment 37 Eero Tamminen 2019-03-01 12:30:33 UTC

(In reply to Mark Janes from comment #36)
> Should this be blocking the Mesa 19.0 release?

At least it started happening for us after the previous release.

Because of bug 108787 caused by Meson, I started to wonder whether Meson is another possible cause for this (I started to see the random hangs sometime after switching to Meson).

Jakub, are you building Mesa with Meson (like me) or Autotools?


> Why wouldn't we suspect a bug in drm-tip instead of Mesa?

Somebody needs to:

* bisect Jakub's 2 fully reproducible compute shader test-case to find out whether they're same issue or not, and whether they are Mesa or kernel issues

* look into attached i915 error files to check whether Jakub's fully reproducible compute hangs, and my very rarely happening CarChase hangs have the same cause.  I.e. should there be separate bugs for these

* Ask for new bugs to be filed where applicable and move them to drm-tip, if they're kernel issues

(If those rare CarChase hangs match neither of Jakub's reproducible compute hang cases, and i915 error file isn't enough to locate the problem, that particular issue can't be located / debugged and probably needs to be wontfix, it's nowadays so rare.)

Comment 38 Jakub Okoński 2019-03-07 16:52:20 UTC

I was using meson to build the release candidates of mesa 19.0. Using this script to be exact: https://git.archlinux.org/svntogit/packages.git/tree/trunk/PKGBUILD?h=packages/mesa#n39

I don't have much time available, but I can try bisecting. Need to come up with some scripts to build these packages on my workstation and not the mobile dual core SKU.

Should I be just bisecting the kernel with latest rc of mesa 19, just mesa, or both at the same time somehow?

Comment 39 Eero Tamminen 2019-03-07 17:34:20 UTC

(In reply to Jakub Okoński from comment #38)
> I was using meson to build the release candidates of mesa 19.0.

Found out that Meson isn't related (it was just enabling asserts in bug 108787).


> I don't have much time available, but I can try bisecting. Need to come up
> with some scripts to build these packages on my workstation and not the
> mobile dual core SKU.
> 
> Should I be just bisecting the kernel with latest rc of mesa 19, just mesa,
> or both at the same time somehow?

First find out which one is the culprit.  For this it should be enough to check some release versions of both, that go far enough back, whatever you can test most easily (e.g. readily available distro packages).  Test new Mesa with old kernel, new kernel with old Mesa, to verify that issue hasn't just moved.

Only after you've found a version(s) that don't have the problem, you can start bisecting.

You might first try bisecting things closer using release versions & pre-built packages if such are available, to minimize building needed for real git bisect.

Comment 40 Jakub Okoński 2019-03-07 18:51:56 UTC

I have a local cache of old packages I used to have installed years ago. I tried a couple kernels down to 4.8.7 from November 2016 with mesa 19.0-rc2 and they all have the same problem as 5.0.

I also have old packages of mesa, down to 11.0.7, but I'm using a rolling release distro and it would take a lot of effort (and probably breaking the system) to downgrade this far back. I was unable to build even 18.x versions locally due to some incompatibilies with llvm.

Maybe I'll try historical livecd versions of Ubuntu to check other Mesa versions. Is that a bad approach?

Comment 41 Eero Tamminen 2019-03-08 10:54:04 UTC

(In reply to Jakub Okoński from comment #40)
> I have a local cache of old packages I used to have installed years ago. I
> tried a couple kernels down to 4.8.7 from November 2016 with mesa 19.0-rc2
> and they all have the same problem as 5.0.
> 
> I also have old packages of mesa, down to 11.0.7, but I'm using a rolling
> release distro and it would take a lot of effort (and probably breaking the
> system) to downgrade this far back. I was unable to build even 18.x versions
> locally due to some incompatibilies with llvm.

i965 doesn't need/use LLVM.  Just disable gallium & RADV and everything LLVM related from your Mesa build.  Using autotools:
--with-dri-drivers=i965 --with-vulkan-drivers= --with-gallium-drivers= --disable-llvm


> Maybe I'll try historical livecd versions of Ubuntu to check other Mesa
> versions. Is that a bad approach?

Comment 42 Eero Tamminen 2019-03-11 12:36:15 UTC

(In reply to Jakub Okoński from comment #40)
> I have a local cache of old packages I used to have installed years ago. I
> tried a couple kernels down to 4.8.7 from November 2016 with mesa 19.0-rc2
> and they all have the same problem as 5.0.

Ok, so with a reproducible test-case it didn't even require new kernel => updated summary.

Comment 43 Eero Tamminen 2019-03-18 11:21:59 UTC

Compute hangs aren't anymore reproducible with my test-cases, but recently I've seen very rarely (system) hang on BXT in GfxBench Manhattan 3.1, which uses compute.

These happen only with Wayland version under Weston, not with X version (under X, or Weston), so they're unlikely to be compute related though.

=> Jakub's fully reproducible test-cases are best for checking this.

Jakub, were you able to narrow down in which Mesa version your hangs happened / is it a 19.x regression?

Comment 44 Jakub Okoński 2019-03-18 22:28:44 UTC

Not yet, I need to find more time to do these rebuilds and bisect. I think I need to create a standalone, vulkan 1.0 test case for this, it's hard to do it in a biffer app.

Can I use the Conformance Test Suite to do this easily? I don't mean contributing to upstream CTS, just spinning off a test case with my problem.

Comment 45 Eero Tamminen 2019-03-20 13:04:38 UTC

Created attachment 143739 [details]
SKL GT3e CarChase GPU hang (mesa: b3aa37046b)

(In reply to Eero Tamminen from comment #43)
> Compute hangs aren't anymore reproducible with my test-cases

Added GPU hang tracking so that I can catch these.  Attached one with yesterday's Mesa Git.

Comment 46 Eero Tamminen 2019-04-09 12:34:47 UTC

(In reply to Jakub Okoński from comment #44)
> Not yet, I need to find more time to do these rebuilds and bisect. I think I
> need to create a standalone, vulkan 1.0 test case for this, it's hard to do
> it in a biffer app.
>
> Can I use the Conformance Test Suite to do this easily?

You might try also Piglit as it seems nowadays to have some support for Vulkan.  On quick browse I didn't see any test for compute with Vulkan though.

>  I don't mean contributing to upstream CTS, just spinning off a test case with my problem.

AFAIK Mesa CI runs both CTS and piglit, so getting the resulting test to upstream version of either piglit or CTS would be good.


(In reply to Eero Tamminen from comment #41)
> i965 doesn't need/use LLVM.  Just disable gallium & RADV and everything LLVM
> related from your Mesa build.  Using autotools:
> --with-dri-drivers=i965 --with-vulkan-drivers= --with-gallium-drivers=
> --disable-llvm

Sorry, I of course meant: "--with-vulkan-drivers=intel".


(In reply to Eero Tamminen from comment #45)
> Added GPU hang tracking so that I can catch these.

Every few days there's recoverable GPU hang on some SKL or BXT device in GfxBench Manhattan 3.1, CarChase or AztecRuins.

Comment 47 Jakub Okoński 2019-04-14 17:44:31 UTC

Finally made some progress here. I have created piglit test cases to demonstrate the problem. I still haven't done any bisecting, so I don't know if it's a regression.

Test #1: passes on my RADV desktop machine, fails on my Gen 9 6500U laptop and freezes the graphics for a couple seconds:

[require]

[compute shader]
#version 450

layout(binding = 0) buffer block {
    uint value[];
};

layout (local_size_x = 33) in;

void main() {
    if (gl_GlobalInvocationID.x >= 20) {
        return;
    }

    barrier();

    value[gl_GlobalInvocationID.x] = gl_GlobalInvocationID.x;
}

[test]
# 60 elements
ssbo 0 240
compute 5 1 1
probe ssbo uint 0 0 == 0
probe ssbo uint 0 16 == 4
probe ssbo uint 0 76 == 19
probe ssbo uint 0 128 == 0
probe ssbo uint 0 132 == 0


I have more variations of this, I could send a patch to piglit if you think it's valuable. Can you try to reproduce this exact case on your hardware?

Comment 48 Jakub Okoński 2019-04-14 17:51:56 UTC

I should have mentioned, the == 4 and == 19 assertions are failing for me, it acts like none of the SIMD lanes executed anything AFAICT.

Comment 49 Jakub Okoński 2019-04-14 18:01:55 UTC

I double checked the SPIR-V specification, and I think this shader is invalid.

> 3.32.20 OpControlBarrier
> This instruction is only guaranteed to work correctly if placed strictly
> within uniform control flow within Execution. This ensures that if any
> invocation executes it, all invocations will execute it. 
> If placed elsewhere, an invocation may stall indefinitely.

I guess RADV and/or AMD hardware can handle this case? Or maybe it's compiled differently?

Comment 50 Eero Tamminen 2019-08-12 14:04:33 UTC

(In reply to Eero Tamminen from comment #46)
> (In reply to Eero Tamminen from comment #45)
> > Added GPU hang tracking so that I can catch these.
> 
> Every few days there's recoverable GPU hang on some SKL or BXT device in
> GfxBench Manhattan 3.1, CarChase or AztecRuins.

Last recoverable GfxBench i965 hangs were months ago, with older (5.0 or realier) kernels.

I've also seen twice Heaven hangs on SKL in June, but not since then.


-> Marking this as WORKSFORME (as I don't know what was fixed).


(In reply to Jakub Okoński from comment #49)
> I double checked the SPIR-V specification, and I think this shader is
> invalid.

If you think there's a valid issue after all with your compute shaders, could you file a separate issue about that?


> > 3.32.20 OpControlBarrier
> > This instruction is only guaranteed to work correctly if placed strictly
> > within uniform control flow within Execution. This ensures that if any
> > invocation executes it, all invocations will execute it. 
> > If placed elsewhere, an invocation may stall indefinitely.
> 
> I guess RADV and/or AMD hardware can handle this case? Or maybe it's
> compiled differently?

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.