Bug 105755

Summary: Mesa freezes when the GLSL shader contains a `for` loop with an uninitialized `i` index/counter variable
Product: Mesa Reporter: Swyter <swyterzone+mesa3d>
Component: Mesa coreAssignee: mesa-dev
Status: RESOLVED MOVED QA Contact: mesa-dev
Severity: normal    
Priority: medium CC: swyterzone+mesa3d
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: default.frag
default.vert
default-fixed.frag
glxinfo.txt
game-log-with-debug-ogl-context-and-callback-no-undefined-warns.txt
sorry, wrong attachment
sorry, wrong attachment

Description Swyter 2018-03-26 23:17:22 UTC
Created attachment 138364 [details]
default.frag

So yeah, this has baffled me for a while. I have had reports of soft-locks and freezes of my OpenGL/SDL2 game in Mesa-based Linux distros for a while, mainly under Intel hardware. The game works fine in Windows and macOS using the same shaders. Luckily I have just found out that it also happens under Nouveau, so the problem seems to be at the GLSL parser-level or other shared/common component.

After debugging the soft-locking issue for a bit it was clear that it only happened whenever the dynamic point lights code path was in use. After disabling and enabling it I suddenly realized something:

> for (int i; i < lightCount; i++)

I forgot to initialize the counter of the loop and that completely froze the entire graphics pipeline, making the system unusable!


I find it funny that glslangValidator doesn't throw any error or warning even if that should cause undefined behavior. Looks like the proprietary AMD and NVIDIA OpenGL drivers initialize the `i` indexing variable to zero, which looks like the sane thing to do.


It was my mistake, I'll try to lint them better. But the consequences are a bit unforgiving.


PS: I'd attach an apitrace, but everything freezes. Let me know if you need anything else, maybe we can give a bunch Steam keys to Mesa developers, in case they are needed. The game is called Sphinx and the Cursed Mummy, a native OpenGL 3+ port.
Comment 1 Swyter 2018-03-26 23:22:30 UTC
Created attachment 138365 [details]
default.vert
Comment 2 Swyter 2018-03-26 23:24:45 UTC
Created attachment 138366 [details]
default-fixed.frag
Comment 3 Swyter 2018-03-26 23:27:32 UTC
Created attachment 138367 [details]
glxinfo.txt
Comment 4 Ilia Mirkin 2018-03-26 23:28:51 UTC
The shader compiler provides warnings:

$ bin/glslparsertest foo.frag pass 1.30
Successfully compiled fragment shader foo.frag: 0:54(14): warning: `i' used uninitialized
0:58(39): warning: `i' used uninitialized
0:59(53): warning: `i' used uninitialized
0:61(30): warning: `i' used uninitialized
0:62(33): warning: `i' used uninitialized
0:54(30): warning: `i' used uninitialized

PIGLIT: {"result": "pass" }

These warnings would be available to you programmatically using the KHR_debug callback, I believe. Or printed to stderr with a debug mesa build.
Comment 5 Swyter 2018-03-27 01:08:37 UTC
Created attachment 138368 [details]
game-log-with-debug-ogl-context-and-callback-no-undefined-warns.txt

@imirkin Looks like these GLSL warnings aren't exposed through GL_ARB_debug_output/GL_KHR_debug callbacks. I originally developed the Linux port using Mesa/radeonsi on a [AMD/ATI] Kaveri [Radeon R7 Graphics] (rev d4) (PCI ID 1002:130f) and the warnings and error reporting always proved useful, but that was before I implemented the dynamic lights part. I need to test if the newer version of the game (with the attached shader) lock ups AMD/radeon cards, too.

Anyway, thanks for the piglit tip. I'll use it from now on to lint my shaders, hopefully they'll end up slightly less 'freezy'.
Comment 6 Ian Romanick 2018-03-27 01:21:37 UTC
Swyter: Can you run your program in gdb, interrupt it during the hang, and provide a backtrace?

Reading Ilia's response, it seems that the trivial glslsparsertest based test case doesn't reproduce the hang.  I guess a shader-runner test that tries to draw with the shader would...
Comment 7 Ilia Mirkin 2018-03-27 01:26:11 UTC
(In reply to Ian Romanick from comment #6)
> Swyter: Can you run your program in gdb, interrupt it during the hang, and
> provide a backtrace?
> 
> Reading Ilia's response, it seems that the trivial glslsparsertest based
> test case doesn't reproduce the hang.  I guess a shader-runner test that
> tries to draw with the shader would...

glslparsertest just parses. The hang is just due to a very long shader loop (sometimes, I'd guess, depending what 'i' happens to get initialized to... like MIN_INT, it'll be looping for a while). gdb would just show it hung in "waiting for draw to complete".
Comment 8 Ilia Mirkin 2018-03-27 02:01:59 UTC
(In reply to Swyter from comment #5)
> Created attachment 138368 [details]
> game-log-with-debug-ogl-context-and-callback-no-undefined-warns.txt
> 
> @imirkin Looks like these GLSL warnings aren't exposed through
> GL_ARB_debug_output/GL_KHR_debug callbacks. I originally developed the Linux

That's very surprising.

This stuff goes through _mesa_glsl_warning which in turn should emit a DEBUG_TYPE_OTHER message. Are you sure that you don't get a callback with that? (Or perhaps you have it just filtering to ERROR's?)
Comment 9 Swyter 2018-03-27 02:56:48 UTC
@imirkin I'm just doing this, nothing too fancy:

>	glEnable(GL_DEBUG_OUTPUT);
>	glEnable(GL_DEBUG_OUTPUT_SYNCHRONOUS); GL_ASSERT;
>	glDebugMessageCallback(GLErrorCallback, NULL); GL_ASSERT;
>	glDebugMessageControl(
>		GL_DONT_CARE,
>		GL_DONT_CARE,
>		GL_DONT_CARE,
>		0,
>		NULL,
>		TRUE
>	); GL_ASSERT;

If you take a look at the game log above you can see that it spits the compiler info stats. Maybe I'm doing something wrong.

Anyway, here is a backtrace from a customer using i965 with an Intel(R) HD Graphics 520. He is running a release build of Mesa, so it doesn't come with debug symbols, but you can see that it is stuck at a DRI syscall at the end of the frame, just like you said:

https://www.reddit.com/r/linux_gaming/comments/81t838/sphinx_and_the_cursed_mummy_freezes_at_intro/dvjsvop/


The only thing that appears in the stdout/err log from the driver side is this:

> i965: Failed to submit batchbuffer: Input/output error

(https://www.reddit.com/r/linux_gaming/comments/81t838/sphinx_and_the_cursed_mummy_freezes_at_intro/dvjhza6/)

Some of the other replies in that reddit thread may be of interest. I told him to use the LIBGL_DEBUG=verbose env var, but it doesn't look too chatty.
Comment 10 Ilia Mirkin 2018-03-27 03:14:09 UTC
(In reply to Swyter from comment #9)
> @imirkin I'm just doing this, nothing too fancy:
> 
> >	glEnable(GL_DEBUG_OUTPUT);
> >	glEnable(GL_DEBUG_OUTPUT_SYNCHRONOUS); GL_ASSERT;
> >	glDebugMessageCallback(GLErrorCallback, NULL); GL_ASSERT;
> >	glDebugMessageControl(
> >		GL_DONT_CARE,
> >		GL_DONT_CARE,
> >		GL_DONT_CARE,
> >		0,
> >		NULL,
> >		TRUE
> >	); GL_ASSERT;
> 
> If you take a look at the game log above you can see that it spits the
> compiler info stats. Maybe I'm doing something wrong.

A couple of stupid questions:

1. Could you be feeding the shader to glCompileShader before the above lines?
2. Could the end user have already cached the compilation of this shader?

For the latter, one can run with MESA_GLSL_CACHE_DISABLE=1 to avoid using the cache.
Comment 11 Ilia Mirkin 2018-03-27 03:29:31 UTC
As an aside... there's no compilation bug here. Perhaps $other driver happens to get you a value of 0, but nothing guarantees that. Could just be luck in precisely how they do RA. Certainly nothing in the spec.

And there are legitimate situations where a variable might be uninitialized prior to (compiler-proven) use, e.g.

int x;
int y = 0;
loop {
  if (a)
    y += x;
  if (b)
    x = 5;
}

The compiler couldn't reasonably prove that a never happens before b does (except in some circumstances).

Would it be the end of the world if one were to add code to zero-initialize all variables? No - but it'd add unnecessary code to otherwise functioning shaders.

Mesa tends to stick to what's required by the spec.

(You could argue that the "shader has gone into infinite loop" case should be handled better, with some success, but that's obviously a very driver-backend-specific issue.)
Comment 12 Swyter 2018-03-27 11:40:38 UTC
Yeah. That was it, I only use a few shaders and they are all compiled at startup. Good to know that caching is implemented Mesa-wide now.

> *EE: [! OGL LOG ] type: 0, local: 0, gpr: 28, inst: 134, bytes: 1432
> *EE: [! OGL LOG ] type: 1, local: 0, gpr: 7, inst: 41, bytes: 440
> *EE: [! OGL LOG ] type: 0, local: 0, gpr: 16, inst: 17, bytes: 184
> *EE: [! OGL LOG ] type: 1, local: 0, gpr: 6, inst: 14, bytes: 152
> *EE: [! OGL LOG ] 0:54(14): warning: `i' used uninitialized
> *EE: [! OGL LOG ] 0:58(39): warning: `i' used uninitialized
> *EE: [! OGL LOG ] 0:59(53): warning: `i' used uninitialized
> *EE: [! OGL LOG ] 0:61(30): warning: `i' used uninitialized
> *EE: [! OGL LOG ] 0:62(33): warning: `i' used uninitialized
> *EE: [! OGL LOG ] 0:54(30): warning: `i' used uninitialized
> *EE: [! OGL LOG ] type: 0, local: 0, gpr: 24, inst: 158, bytes: 1688
> *EE: [! OGL LOG ] type: 1, local: 0, gpr: 23, inst: 110, bytes: 1176
> *EE: [! OGL LOG ] type: 0, local: 0, gpr: 20, inst: 59, bytes: 632
> *EE: [! OGL LOG ] type: 1, local: 0, gpr: 6, inst: 14, bytes: 152

I don't know about how the GLSL spec handles this kind of undefined behavior, and I agree with you that the implementation should not try to handhold shader authors. What we need is an ecosystem that detects these issues earlier (i.e. glslangValidator warnings) and responds robustly by detecting when the draw call/batch/job is idle spinning/stuck by having some kind of reasonable timeout and killing it with a descriptive message when things go out of hand, instead of making the entire system unusable. And this is where I'm probably talking about stuff I don't know about, but this could be probably enforced in the DRI/DRM side of things instead of being driver-specific.

At the very least I *think* DXGI does something similar, but don't quote me on that.


PS: As an early mitigation maybe the special case of self-incrementing/decrementing counter variables could be zero-initialized as fallback. You could think of this test case as fuzzing, DoS prevention. I bet it doesn't bloat the production shaders that you guys use for regression testing much.

Someone probably has to try this in a robust WebGL context and see what happens.
Comment 13 Swyter 2018-03-27 12:22:55 UTC
I have opened a glslang issue here, I thought that cross-referencing both reports might be interesting, for completeness:

https://github.com/KhronosGroup/glslang/issues/1315
Comment 14 Ilia Mirkin 2018-03-28 00:47:24 UTC
(In reply to Swyter from comment #12)
> detecting when the draw call/batch/job is idle spinning/stuck by having some
> kind of reasonable timeout and killing it with a descriptive message when
> things go out of hand, instead of making the entire system unusable. And
> this is where I'm probably talking about stuff I don't know about, but this
> could be probably enforced in the DRI/DRM side of things instead of being
> driver-specific.

Each driver is a separate implementation of DRM. Each driver has a separate compiler. An infinite loop can happen without any warnings whatsoever (i.e. done on purpose), and drivers ought to handle it gracefully.

But they don't. Killing jobs isn't always straightforward, GPU resets are a plain disaster. This has to be done very much in concert with the hardware, so each driver is on its own. I don't think any handle this well, but nouveau is probably particularly bad.
Comment 15 Roland Scheidegger 2018-03-28 01:27:47 UTC
(In reply to Ilia Mirkin from comment #14)
> (In reply to Swyter from comment #12)
> > detecting when the draw call/batch/job is idle spinning/stuck by having some
> > kind of reasonable timeout and killing it with a descriptive message when
> > things go out of hand, instead of making the entire system unusable. And
> > this is where I'm probably talking about stuff I don't know about, but this
> > could be probably enforced in the DRI/DRM side of things instead of being
> > driver-specific.
> 
> Each driver is a separate implementation of DRM. Each driver has a separate
> compiler. An infinite loop can happen without any warnings whatsoever (i.e.
> done on purpose), and drivers ought to handle it gracefully.
> 
> But they don't. Killing jobs isn't always straightforward, GPU resets are a
> plain disaster. This has to be done very much in concert with the hardware,
> so each driver is on its own. I don't think any handle this well, but
> nouveau is probably particularly bad.

I think it works better in windows because you can just reset gpu, reload the driver, restart dwm (basically start from scratch). Albeit still need to identify the right job to kill.
(FWIW the mileage on macs will vary too, depending on IHV or maybe even chip, it is definitely not flawless there neither - I think on chips not driving displays it generally works much better. Just confirming this isn't an easy problem.)
Comment 16 i.kalvachev 2018-03-28 17:55:43 UTC
(In reply to Ilia Mirkin from comment #11)
> As an aside... there's no compilation bug here. Perhaps $other driver
> happens to get you a value of 0, but nothing guarantees that. Could just be
> luck in precisely how they do RA. Certainly nothing in the spec.
> 
> And there are legitimate situations where a variable might be uninitialized
> prior to (compiler-proven) use, e.g.
> 
> int x;
> int y = 0;
> loop {
>   if (a)
>     y += x;
>   if (b)
>     x = 5;
> }
> 
> The compiler couldn't reasonably prove that a never happens before b does
> (except in some circumstances).
> 
> Would it be the end of the world if one were to add code to zero-initialize
> all variables? No - but it'd add unnecessary code to otherwise functioning
> shaders.
> 
> Mesa tends to stick to what's required by the spec.


The spec might not define what you should do in this case, but this means that we are free to handle the case as we see fit. Especially if we do something reasonable and consistent.

If the compiler cannot reasonably prove that the variable will be initialized and the loop will finish, then maybe that is because it wont.

It is possible that the shader would work properly only for a limited range of inputs. If these inputs are result of previous rendering, it might cause an extremely hard to trigger and reproduce bugs/hangs.

With SSA and phi it is very easy to find when variable is used uninitialized and handle the case in deterministic way.
Comment 17 Ilia Mirkin 2018-03-28 18:38:49 UTC
(In reply to iive from comment #16)
> With SSA and phi it is very easy to find when variable is used uninitialized
> and handle the case in deterministic way.

So initialize all those to MIN_INT? That'll work nicely.

That's the thing - there is no single "correct" thing to do in such situations.

for (float f; f > 0; f -= 1.0) { ... }

You really can't predict it. There's no real point in trying.
Comment 18 Swyter 2018-03-28 20:04:32 UTC
I bet that integers are the most common type by a wide margin. I also bet that most of these loops are meant to be unrolled and vectorized.

Covering 100% of the cases is almost impossible, but initializing uninitialized variables inside of the condition block to some known/sane value like NULL/0.f is probably a 90% solution, because the value is going to be wrong and cause an infinite loop anyway. You can only win.
Comment 19 Sergii Romantsov 2018-03-29 13:19:57 UTC
Created attachment 138420 [details]
sorry, wrong attachment
Comment 20 Sergii Romantsov 2018-03-29 13:20:19 UTC
Created attachment 138422 [details]
sorry, wrong attachment
Comment 21 GitLab Migration User 2019-09-18 20:27:07 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1024.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.