Bug 108820 - [SKL] (recoverable) GPU hangs in benchmarks using compute shaders with drm-tip v4.19+ kernels
Summary: [SKL] (recoverable) GPU hangs in benchmarks using compute shaders with drm-ti...
Status: NEW
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/DRI/i965 (show other bugs)
Version: git
Hardware: Other All
: medium normal
Assignee: Intel 3D Bugs Mailing List
QA Contact: Intel 3D Bugs Mailing List
URL:
Whiteboard:
Keywords: regression
Depends on:
Blocks: mesa-19.0
  Show dependency treegraph
 
Reported: 2018-11-21 11:34 UTC by Eero Tamminen
Modified: 2019-02-12 11:44 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
CarChase GPU hang (119.46 KB, text/plain)
2019-01-03 10:52 UTC, Eero Tamminen
Details
INTEL_DEBUG=cs,do32 output for the 65 local workgroup size case (GPU hang) (202.14 KB, text/plain)
2019-01-09 21:51 UTC, Jakub Okoński
Details
INTEL_DEBUG=cs,do32 output for the 64 local workgroup size case (working) (201.50 KB, text/plain)
2019-01-09 21:51 UTC, Jakub Okoński
Details
MESA_GLSL_CACHE_DISABLE=true INTEL_DEBUG=cs with 65 items (failing) (149.49 KB, text/plain)
2019-01-10 15:36 UTC, Jakub Okoński
Details
MESA_GLSL_CACHE_DISABLE=true INTEL_DEBUG=cs with 64 items (working) (149.03 KB, text/plain)
2019-01-10 15:36 UTC, Jakub Okoński
Details
MESA_GLSL_CACHE_DISABLE=true INTEL_DEBUG=cs for the minimal repro shader (8.48 KB, text/plain)
2019-01-10 18:25 UTC, Jakub Okoński
Details
SKL GT3e CarChase GPU hang (119.65 KB, text/plain)
2019-02-05 11:10 UTC, Eero Tamminen
Details
/sys/class/drm/card0/error Jakub (48.00 KB, text/plain)
2019-02-05 13:02 UTC, Jakub Okoński
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Eero Tamminen 2018-11-21 11:34:02 UTC
Setup:
* SKL GT2 / GT3e
* Ubuntu 18.04
* *drm-tip* v4.19 kernel
* Mesa & X git head

Test-case:
* Run a test-case using compute shaders

Expected output:
* No GPU hangs (like with earlier Mesa commits)

Actual output:
* Recoverable GPU hangs in compute shader using test-cases:
  - GfxBench Aztec Ruins, CarChase and Manhattan 3.1
  - Sacha Willems' Vulkan compute demos
  - SynMark CSDof / CSCloth
* Vulkan compute demos fail to run (other tests run successfully despite hangs)

This seems to be SKL specific, it's not visible on other HW.

This regression happened between following Mesa commits:
* dca35c598d: 2018-11-19 15:57:41: intel/fs,vec4: Fix a compiler warning
* a999798daa: 2018-11-20 17:09:22: meson: Add tests to suites

It also seems to be specific to *drm-tip* v4.19.0 kernel as I don't see it with latest drm-tip v4.20.0-rc3 kernel.  So it's also possible that it's a bug in i915, that just gets triggered by Mesa change, and which got fixed later.


Sacha Willems' Vulkan Raytracing demo outputs following on first run:
---------------------------------
SPIR-V WARNING:
    In file src/compiler/spirv/vtn_variables.c:1897
    Source and destination types of SpvOpStore do not have the same ID (but are compatible): 225 vs 212
    14920 bytes into the SPIR-V binary
SPIR-V WARNING:
    In file src/compiler/spirv/vtn_variables.c:1897
    Source and destination types of SpvOpStore do not have the same ID (but are compatible): 225 vs 212
    10300 bytes into the SPIR-V binary
SPIR-V WARNING:
    In file src/compiler/spirv/vtn_variables.c:1897
    Source and destination types of SpvOpStore do not have the same ID (but are compatible): 269 vs 256
    10944 bytes into the SPIR-V binary
SPIR-V WARNING:
    In file src/compiler/spirv/vtn_variables.c:1897
    Source and destination types of SpvOpStore do not have the same ID (but are compatible): 225 vs 212
    11920 bytes into the SPIR-V binary
INTEL-MESA: error: src/intel/vulkan/anv_device.c:2091: GPU hung on one of our command buffers (VK_ERROR_DEVICE_LOST)
vulkan_raytracing: base/vulkanexamplebase.cpp:651: void VulkanExampleBase::submitFrame(): Assertion `res == VK_SUCCESS' failed.
-----------------------------

(Other runs show just the error and assert.)
Comment 1 Mark Janes 2018-11-21 18:56:22 UTC
Since this bug is limited to a drm-tip kernel, it seems likely that the problem is in the kernel, not in mesa.  Can you reproduce it on any released kernel?
Comment 2 Eero Tamminen 2018-11-28 14:41:51 UTC
(In reply to Eero Tamminen from comment #0)
> It also seems to be specific to *drm-tip* v4.19.0 kernel as I don't see it
> with latest drm-tip v4.20.0-rc3 kernel.  So it's also possible that it's a
> bug in i915, that just gets triggered by Mesa change, and which got fixed
> later.

I've now seen hangs also with drm-tip v4.20.0-rc3 kernel.


However, these GPU hangs don't happen anymore with this or later Mesa commit
(regardless of whether they're with v1.19 or v4.20-rc4 drm-tip kernels):
3c96a1e3a97ba 2018-11-26 08-29-39: radv: Fix opaque metadata descriptor last layer

-> FIXED?

(I'm lacking data for several previous days, so I can't give an exact time when those hangs stopped.)

Raytracing demo SPIR-V warnings happen still, although I updated Sacha Willem's demos to latest Git version.
Comment 3 Eero Tamminen 2018-11-30 16:47:18 UTC
Sorry, all the hangs have happened with drm-tip v4.20-rc versions, not v4.19.

Last night there were again recoverable hangs on SKL, with drm-tip v4.20-rc4:
* GfxBench v5-GOLD2 Aztec Ruins GL & Vulkan ("normal") versions
* Ungine Heaven v4.0
* SynMark v7 CSCloth

Heaven doesn't use compute shaders, so maybe the issue isn't compute related after all.
Comment 4 Jakub Okoński 2018-12-25 20:41:28 UTC
It also affects me on Skylake i7-6500U, Mesa 18.3.1 and kernel 4.19.12. I am able to reproduce it in my own vulkan app. If I don't dispatch any compute work, everything works fine. As soon as I submit a CB with dispatches, I get that same error:

INTEL-MESA: error: ../mesa-18.3.1/src/intel/vulkan/anv_device.c:2091: GPU hung on one of our command buffers (VK_ERROR_DEVICE_LOST)
Comment 5 Jakub Okoński 2018-12-25 20:44:00 UTC
I should clarify: when I said "as soon as I submit", I mean the vkQueueSubmit call exits with device lost error. Before it returns, my desktop freezes for a couple seconds (maybe I can move my mouse, but it doesn't render the new position while hung).
Comment 6 Lionel Landwerlin 2019-01-02 14:55:03 UTC
Could you attach the /sys/class/drm/card0/error file after you notice a hang?
Thanks!
Comment 7 Eero Tamminen 2019-01-03 10:52:23 UTC
Created attachment 142952 [details]
CarChase GPU hang

I'm now pretty sure it's drm-tip kernel issue.  It went away after v4.20-rc4 kernel version, at end of November, but still happens when using our last v4.19 drm-tip build (currently used in our Mesa tracking).

It seems to happen more frequently with SKL GT3e than GT2.  Attached is error state for GfxBench CarChase hang with Mesa (8c93ef5de98a9) from couple of days ago.

I've now updated our Mesa tracking to use v4.20 drm-tip build, I'll tell next week whether that helped (as expected).
Comment 8 Lionel Landwerlin 2019-01-03 12:25:15 UTC
I know this is going to be painful, but it would be really good to have a bisect on what commit broke this...
Skimming through the logs, I couldn't find anything between drm-tip/4.18-rc7 and drm-tip/4.20-rc4 that indicates a hang of this kind on gen9.

A bit later (4th of December) this fix appeared that could impact  :

commit 4a15c75c42460252a63d30f03b4766a52945fb47
Author: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Date:   Mon Dec 3 13:33:41 2018 +0000

    drm/i915: Introduce per-engine workarounds
    
    We stopped re-applying the GT workarounds after engine reset since commit
    59b449d5c82a ("drm/i915: Split out functions for different kinds of
    workarounds").
    
    Issue with this is that some of the GT workarounds live in the MMIO space
    which gets lost during engine resets. So far the registers in 0x2xxx and
    0xbxxx address range have been identified to be affected.
    
    This losing of applied workarounds has obvious negative effects and can
    even lead to hard system hangs (see the linked Bugzilla).
    
    Rather than just restoring this re-application, because we have also
    observed that it is not safe to just re-write all GT workarounds after
    engine resets (GPU might be live and weird hardware states can happen),
    we introduce a new class of per-engine workarounds and move only the
    affected GT workarounds over.
    
    Using the framework introduced in the previous patch, we therefore after
    engine reset, re-apply only the workarounds living in the affected MMIO
    address ranges.
    
    v2:
     * Move Wa_1406609255:icl to engine workarounds as well.
     * Rename API. (Chris Wilson)
     * Drop redundant IS_KABYLAKE. (Chris Wilson)
     * Re-order engine wa/ init so latest platforms are first. (Rodrigo Vivi)
    
    Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Bugzilla: https://bugzilla.freedesktop.org/show_bug.cgi?id=107945
    Fixes: 59b449d5c82a ("drm/i915: Split out functions for different kinds of workarounds")
Comment 9 Eero Tamminen 2019-01-07 10:03:30 UTC
Still seeing the hangs with latest Mesa and drm-tip 4.20 kernel, on SKL GT3e & GT4e.

It happens approximately on 1 out of 3 runs.

Seems to happen only with:
* Aztec Ruins (normal, FullHD resolution)
* Carchase, but only when it's run in 4K resolution (not in FullHD):
  testfw_app --gfx glfw --gl_api desktop_core --width 3840 --height 2160 --fullscreen 0 --test_id gl_4
Comment 10 Eero Tamminen 2019-01-08 09:20:23 UTC
(In reply to Jakub Okoński from comment #4)
> It also affects me on Skylake i7-6500U, Mesa 18.3.1 and kernel 4.19.12. I am
> able to reproduce it in my own vulkan app. If I don't dispatch any compute
> work, everything works fine. As soon as I submit a CB with dispatches, I get
> that same error:
> 
> INTEL-MESA: error: ../mesa-18.3.1/src/intel/vulkan/anv_device.c:2091: GPU
> hung on one of our command buffers (VK_ERROR_DEVICE_LOST)

Is this fully reproducible?  If yes, could you either attach your test-case, or (preferably :)) try bisecting it from Mesa?

As can be seen from above comments, this isn't reproducible enough in my test-cases that I could reliably bisect it (even whether it's Mesa or kernel issue).
Comment 11 Jakub Okoński 2019-01-08 20:33:50 UTC
I don't know about these other applications, I've been experiencing this issue in my own vulkan app. I had time today to mess around a bit more. First I removed my dispatches and it worked fine, then I brought them back and started simplifying my compute shader.

So far, I've been able to isolate the issue to a single `barrier()` GLSL call near the end of my shader. I have another barrier earlier - `memoryBarrierShared()` and it doesn't cause any issues. Perhaps this is isolated to control flow barriers in compute shaders?

I am preparing my code to serve as a repro case, I should have it soon, but I use a Rust toolchain so it might not be the easiest.
Comment 12 Jakub Okoński 2019-01-08 20:40:53 UTC
OK, I think I found the precise issue. It occurs when using a control flow barrier in the shader with more than 64 items in the workgroup.

To put in concrete terms:

Shader A:
...
layout (local_size_x = 64) in;
void main() {
    // code
    barrier();
}

----
Works fine.

Shader B:
...
layout (local_size_x = 65) in;
void main() {
    // code
    barrier();
}

----
Hangs with INTEL-MESA: error: src/intel/vulkan/anv_device.c:2091: GPU hung on one of our command buffers (VK_ERROR_DEVICE_LOST).


Shader C:
...
layout (local_size_x = 65) in;
void main() {
    // code
    // barrier(); without any control flow barriers inside
}
----
Works fine as well.

This should be enough to zoom into the issue, but if you need code you can execute and repro locally, let me know and I can deliver it.
Comment 13 Jakub Okoński 2019-01-08 20:50:00 UTC
Vulkan spec defines a minimum of 128 items in the first dimension of a workgroup, the driver reports maxComputeWorkGroupSize[0] = 896 so I think my application is well behaved in this case and should not hang because of limits.
Comment 14 Eero Tamminen 2019-01-09 12:24:12 UTC
(In reply to Jakub Okoński from comment #12)
> OK, I think I found the precise issue. It occurs when using a control flow
> barrier in the shader with more than 64 items in the workgroup.

Great, thanks!


(In reply to Jakub Okoński from comment #13)
> Vulkan spec defines a minimum of 128 items in the first dimension of a
> workgroup, the driver reports maxComputeWorkGroupSize[0] = 896 so I think my
> application is well behaved in this case and should not hang because of
> limits.

At certain workgroup size thresholds, at least the used SIMD mode can increase (threshold differs between platforms).

You can check whether there's a change in that between working and non-working cases with something like this:
  INTEL_DEBUG=cs <your test-case>  2> shader-ir-asm.txt
  grep ^SIMD shader-ir-asm.txt

If SIMD mode doesn't change, what's the diff between the shader IR/ASM output of the two versions?
Comment 15 Jakub Okoński 2019-01-09 21:50:13 UTC
I couldn't get the output of `INTEL_DEBUG=cs`, it returned an output once but I lost due to terminal scrollback. No matter how many times I ran it again, it never dumped the actual CS shader info.

I was successful when using `INTEL_DEBUG=cs,do32`, the combination of options prints the expected output every time. It doesn't change when my program works or crashes, so I hope that's OK.

So with the forced SIMD32 mode, codegen is still different and the issue remains. I'm attaching both outputs below (do32-failing.txt and do32-working.txt), here's the diff for generated native code:

 Native code for unnamed compute shader (null)
-SIMD32 shader: 496 instructions. 0 loops. 19586 cycles. 0:0 spills:fills. Promoted 0 constants. Compacted 7936 to 6560 bytes (17%)
+SIMD32 shader: 498 instructions. 0 loops. 19586 cycles. 0:0 spills:fills. Promoted 0 constants. Compacted 7968 to 6576 bytes (17%)
    START B0 (162 cycles)
 mov(8)          g4<1>UW         0x76543210V                     { align1 WE_all 1Q };
 mov(16)         g60<1>UD        g0.1<0,1,0>UD                   { align1 1H compacted };
@@ -4354,16 +4354,18 @@
 add(16)         g63<1>D         g8<8,8,1>D      g1.5<0,1,0>D    { align1 2H };
 add(16)         g3<1>UW         g4<16,16,1>UW   0x0010UW        { align1 WE_all 1H };
 mov(16)         g58<1>D         g4<8,8,1>UW                     { align1 1H };
-shl(16)         g68<1>D         g55<8,8,1>D     0x00000006UD    { align1 1H };
-shl(16)         g46<1>D         g63<8,8,1>D     0x00000006UD    { align1 2H };
+mul(16)         g68<1>D         g55<8,8,1>D     65D             { align1 1H compacted };
+mul(16)         g46<1>D         g63<8,8,1>D     65D             { align1 2H };
 shl(16)         g56<1>D         g2<0,1,0>D      0x00000005UD    { align1 1H };
 shl(16)         g64<1>D         g2<0,1,0>D      0x00000005UD    { align1 2H };
 mov(16)         g66<1>D         g3<8,8,1>UW                     { align1 2H };
 add(16)         g60<1>D         g58<8,8,1>D     g56<8,8,1>D     { align1 1H compacted };
-and(16)         g62<1>UD        g60<8,8,1>UD    0x0000003fUD    { align1 1H compacted };
+math intmod(8)  g62<1>UD        g60<8,8,1>UD    0x00000041UD    { align1 1Q compacted };
+math intmod(8)  g63<1>UD        g61<8,8,1>UD    0x00000041UD    { align1 2Q compacted };
 add.z.f0(16)    g76<1>D         g68<8,8,1>D     g62<8,8,1>D     { align1 1H compacted };
 add(16)         g68<1>D         g66<8,8,1>D     g64<8,8,1>D     { align1 2H };
-and(16)         g70<1>UD        g68<8,8,1>UD    0x0000003fUD    { align1 2H };
+math intmod(8)  g70<1>UD        g68<8,8,1>UD    0x00000041UD    { align1 3Q };
+math intmod(8)  g71<1>UD        g69<8,8,1>UD    0x00000041UD    { align1 4Q };
 add.z.f0(16)    g50<1>D         g46<8,8,1>D     g70<8,8,1>D     { align1 2H };
 (+f0) if(32)    JIP: 416        UIP: 416                        { align1 };
    END B0 ->B1 ->B2
Comment 16 Jakub Okoński 2019-01-09 21:51:01 UTC
Created attachment 143046 [details]
INTEL_DEBUG=cs,do32 output for the 65 local workgroup size case (GPU hang)
Comment 17 Jakub Okoński 2019-01-09 21:51:16 UTC
Created attachment 143047 [details]
INTEL_DEBUG=cs,do32 output for the 64 local workgroup size case (working)
Comment 18 Eero Tamminen 2019-01-10 11:53:46 UTC
(In reply to Jakub Okoński from comment #15)
> I couldn't get the output of `INTEL_DEBUG=cs`, it returned an output once
> but I lost due to terminal scrollback. No matter how many times I ran it
> again, it never dumped the actual CS shader info.

That's weird.

If shaders come from cache, they seem to be missing "SIMD" line, and some other info, but the actual assembly instructions should be there.

Do you get shader info, if you also disable cache:
  MESA_GLSL_CACHE_DISABLE=true INTEL_DEBUG=cs <use-case>
?
Comment 19 Jakub Okoński 2019-01-10 15:35:21 UTC
It helps and I get reliable output, although I don't understand how the GLSL caching option is relevant. I use vulkan without VkPipelineCache, and yet it is being cached anyway?

I'm attaching new debug outputs, the diffs look pretty similar, but they now use SIMD16 in both cases, with one more instruction in the failing shader.
Comment 20 Jakub Okoński 2019-01-10 15:36:24 UTC
Created attachment 143060 [details]
MESA_GLSL_CACHE_DISABLE=true INTEL_DEBUG=cs with 65 items (failing)
Comment 21 Jakub Okoński 2019-01-10 15:36:38 UTC
Created attachment 143061 [details]
MESA_GLSL_CACHE_DISABLE=true INTEL_DEBUG=cs with 64 items (working)
Comment 22 Eero Tamminen 2019-01-10 17:42:45 UTC
(In reply to Jakub Okoński from comment #19)
> It helps and I get reliable output, although I don't understand how the GLSL
> caching option is relevant.

Shader compiler is shared between Vulkan and GL drivers.

> I use vulkan without VkPipelineCache, and yet it
> is being cached anyway?

Aren't pipeline objects higher level concept than shaders?  Shader caching underneath should be invisible to the upper layers (except for performance and debug output), and not affect the resulting shader binaries (unless it's buggy).


Lionel, do you think the shader assembly changes between working and non-working work group sizes has anything to do with the hangs?  There's no code-flow difference, few instruction changes (shift -> mul, and -> 2x intmod) look innocent to me.


-> seems that minimal reproduction code would help. :-)
Comment 23 Jakub Okoński 2019-01-10 18:25:10 UTC
I'm trying to reduce the shader to bare-minimum that reproduces the hang, here it is:

```
#version 450

layout (local_size_x = 65) in;

void main() {
   if (gl_GlobalInvocationID.x >= 20) {
       return;
   }

   barrier();
}
```

New piece of information is the early-return that is required to trigger the hang. From this baseline shader, if I decrease local_size_x to 64, it works. Or I can remove the if statement and it will also work. Or I can remove the barrier() and it will also start to work. It seems to be a combination of these that causes the hang. If I take away any piece from it, the problem goes away.

I'm attaching the debug info on this minimal shader.
Comment 24 Jakub Okoński 2019-01-10 18:25:38 UTC
Created attachment 143062 [details]
MESA_GLSL_CACHE_DISABLE=true INTEL_DEBUG=cs for the minimal repro shader
Comment 25 Jakub Okoński 2019-01-10 18:30:59 UTC
I spoke a bit too soon. For this minimal shader, the local_size_x seems irrelevant and decreasing it doesn't prevent the hang. I wonder if this is the same issue or if I uncovered some other issue by reducing the test case.
Comment 26 Eero Tamminen 2019-01-16 11:56:21 UTC
Lionel, is the info from Jakub enough to reproduce the issue?


FYI: Last night git builds of drm-tip kernel, X and Mesa didn't have hangs on SKL, but all BXT devices had hard hang in Aztec Ruins Vulkan test-case.

(Just git version of Mesa + v4.20 drm-tip kernel wasn't enough to trigger it.)
Comment 27 Eero Tamminen 2019-01-17 09:10:35 UTC
(In reply to Eero Tamminen from comment #26)
> FYI: Last night git builds of drm-tip kernel, X and Mesa didn't have hangs
> on SKL, but all BXT devices had hard hang in Aztec Ruins Vulkan test-case.

This was unrelated (kernel) bug which seems to have been introduced and fixed (day later) by Chris.
Comment 28 Eero Tamminen 2019-02-04 12:00:02 UTC
Jakub, are still seeing the hangs with your test-cases?

I haven't see them for a while with my test-cases when using latest Mesa (and drm-tip kernel 2.20 or 5.0-rc).

There have been a couple of CS related Mesa fixes since I filed this bug:
----------------------------------------------------------------
commit fea5b8e5ad5042725cb52d6d37256b9185115502
Author:     Oscar Blumberg <carnaval@12-10e.me>
AuthorDate: Sat Jan 26 16:47:42 2019 +0100
Commit:     Kenneth Graunke <kenneth@whitecape.org>
CommitDate: Fri Feb 1 10:53:33 2019 -0800

    intel/fs: Fix memory corruption when compiling a CS
    
    Missing check for shader stage in the fs_visitor would corrupt the
    cs_prog_data.push information and trigger crashes / corruption later
    when uploading the CS state.
...
commit 31e4c9ce400341df9b0136419b3b3c73b8c9eb7e
Author:     Lionel Landwerlin <lionel.g.landwerlin@intel.com>
AuthorDate: Thu Jan 3 16:18:48 2019 +0000
Commit:     Lionel Landwerlin <lionel.g.landwerlin@intel.com>
CommitDate: Fri Jan 4 11:18:54 2019 +0000

    i965: add CS stall on VF invalidation workaround
    
    Even with the previous commit, hangs are still happening. The problem
    there is that the VF cache invalidate do happen immediately without
    waiting for previous rendering to complete. What happens is that we
    invalidate the cache the moment the PIPE_CONTROL is parsed but we
    still have old rendering in the pipe which continues to pull data into
    the cache with the old high address bits. The later rendering with the
    new high address bits then doesn't have the clean cache that it
    expects/needs.
----------------------------------------------------------------

If you're still seeing hangs and they're 100% reproducible, I think it would be better to file a separate bug about it, and get it bisected.
Comment 29 Jakub Okoński 2019-02-04 16:20:07 UTC
Still hangs on 5.0-rc5, I will try compiling latest mesa from git to see if that helps.
Comment 30 Jakub Okoński 2019-02-04 18:25:04 UTC
I tried on mesa 19.1.0 git revision 64d3b148fe7 and it also hangs, do you want me to create another issue?
Comment 31 Eero Tamminen 2019-02-05 11:10:36 UTC
Created attachment 143303 [details]
SKL GT3e CarChase GPU hang

Gah. Every time I comment that this seems to have gone, the very next day I get a new (recoverable) hang.  I.e. this happens nowadays *very* rarely.

This time there was one recoverable hang in GfxBench CarChase on SKL GT3e, no hangs on the other machines.  Like earlier, it doesn't have significant impact on performance.


(In reply to Jakub Okoński from comment #30)
> I tried on mesa 19.1.0 git revision 64d3b148fe7 and it also hangs, do you
> want me to create another issue?

Lionel, any comments?


Jakub, you could attach i915 error state from:
  /sys/class/drm/card0/error

So that Mesa developers can check whether your hangs happen in the same place as mine.
Comment 32 Jakub Okoński 2019-02-05 13:02:05 UTC
Created attachment 143305 [details]
/sys/class/drm/card0/error Jakub

Here you go.
Comment 33 Eero Tamminen 2019-02-06 10:55:03 UTC
(In reply to Jakub Okoński from comment #25)
> I spoke a bit too soon. For this minimal shader, the local_size_x seems
> irrelevant and decreasing it doesn't prevent the hang. I wonder if this is
> the same issue or if I uncovered some other issue by reducing the test case.

(In reply to Jakub Okoński from comment #32)
> Created attachment 143305 [details]
> /sys/class/drm/card0/error Jakub
> 
> Here you go.

If there are still two different ways of reliably triggering the hang, could you attach error output also for the other and name them so that they can be differentiated?

(E.g. "minimal conditional return + barrier case hang" and "large / local_size_x case hang".)
Comment 34 Jakub Okoński 2019-02-07 16:03:06 UTC
I hope I'm not going crazy, but on 5.0-rc5 with mesa 19.0-rc2, the goal post has moved to 32 shader invocations in a local group. So comment #12 is outdated when it comes to the number. Otherwise, the behavior is the same, it's the combination of > 32 items AND conditional early return statements that cause the hang.

So in the end I only have one repro case I think, it's this:

#version 450

layout (local_size_x = 33) in;

void main() {
    if (gl_GlobalInvocationID.x >= 20) {
        return;
    }

    barrier();
}

From here, I can do any of:
1) comment out barrier() call
2) comment out the return statement (the if can stay)
3) decrease local_size_x to 32

And it will prevent the crash from happening. The drm/card0 error that I uploaded on February 5th is the only crash I can provide.
Comment 35 Jakub Okoński 2019-02-07 16:06:35 UTC
I meant to say: it's the combination of > 32 items AND conditional early return statements AND a barrier that cause the hang.

I also checked replacing the barrier() call with memory barriers, and it prevents the crash. So only execution barriers are a component/contributor to this issue.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.