Summary: | Mesa freezes when the GLSL shader contains a `for` loop with an uninitialized `i` index/counter variable | ||
---|---|---|---|
Product: | Mesa | Reporter: | Swyter <swyterzone+mesa3d> |
Component: | Mesa core | Assignee: | mesa-dev |
Status: | RESOLVED MOVED | QA Contact: | mesa-dev |
Severity: | normal | ||
Priority: | medium | CC: | swyterzone+mesa3d |
Version: | unspecified | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
default.frag
default.vert default-fixed.frag glxinfo.txt game-log-with-debug-ogl-context-and-callback-no-undefined-warns.txt sorry, wrong attachment sorry, wrong attachment |
Description
Swyter
2018-03-26 23:17:22 UTC
Created attachment 138365 [details]
default.vert
Created attachment 138366 [details]
default-fixed.frag
Created attachment 138367 [details]
glxinfo.txt
The shader compiler provides warnings: $ bin/glslparsertest foo.frag pass 1.30 Successfully compiled fragment shader foo.frag: 0:54(14): warning: `i' used uninitialized 0:58(39): warning: `i' used uninitialized 0:59(53): warning: `i' used uninitialized 0:61(30): warning: `i' used uninitialized 0:62(33): warning: `i' used uninitialized 0:54(30): warning: `i' used uninitialized PIGLIT: {"result": "pass" } These warnings would be available to you programmatically using the KHR_debug callback, I believe. Or printed to stderr with a debug mesa build. Created attachment 138368 [details]
game-log-with-debug-ogl-context-and-callback-no-undefined-warns.txt
@imirkin Looks like these GLSL warnings aren't exposed through GL_ARB_debug_output/GL_KHR_debug callbacks. I originally developed the Linux port using Mesa/radeonsi on a [AMD/ATI] Kaveri [Radeon R7 Graphics] (rev d4) (PCI ID 1002:130f) and the warnings and error reporting always proved useful, but that was before I implemented the dynamic lights part. I need to test if the newer version of the game (with the attached shader) lock ups AMD/radeon cards, too.
Anyway, thanks for the piglit tip. I'll use it from now on to lint my shaders, hopefully they'll end up slightly less 'freezy'.
Swyter: Can you run your program in gdb, interrupt it during the hang, and provide a backtrace? Reading Ilia's response, it seems that the trivial glslsparsertest based test case doesn't reproduce the hang. I guess a shader-runner test that tries to draw with the shader would... (In reply to Ian Romanick from comment #6) > Swyter: Can you run your program in gdb, interrupt it during the hang, and > provide a backtrace? > > Reading Ilia's response, it seems that the trivial glslsparsertest based > test case doesn't reproduce the hang. I guess a shader-runner test that > tries to draw with the shader would... glslparsertest just parses. The hang is just due to a very long shader loop (sometimes, I'd guess, depending what 'i' happens to get initialized to... like MIN_INT, it'll be looping for a while). gdb would just show it hung in "waiting for draw to complete". (In reply to Swyter from comment #5) > Created attachment 138368 [details] > game-log-with-debug-ogl-context-and-callback-no-undefined-warns.txt > > @imirkin Looks like these GLSL warnings aren't exposed through > GL_ARB_debug_output/GL_KHR_debug callbacks. I originally developed the Linux That's very surprising. This stuff goes through _mesa_glsl_warning which in turn should emit a DEBUG_TYPE_OTHER message. Are you sure that you don't get a callback with that? (Or perhaps you have it just filtering to ERROR's?) @imirkin I'm just doing this, nothing too fancy: > glEnable(GL_DEBUG_OUTPUT); > glEnable(GL_DEBUG_OUTPUT_SYNCHRONOUS); GL_ASSERT; > glDebugMessageCallback(GLErrorCallback, NULL); GL_ASSERT; > glDebugMessageControl( > GL_DONT_CARE, > GL_DONT_CARE, > GL_DONT_CARE, > 0, > NULL, > TRUE > ); GL_ASSERT; If you take a look at the game log above you can see that it spits the compiler info stats. Maybe I'm doing something wrong. Anyway, here is a backtrace from a customer using i965 with an Intel(R) HD Graphics 520. He is running a release build of Mesa, so it doesn't come with debug symbols, but you can see that it is stuck at a DRI syscall at the end of the frame, just like you said: https://www.reddit.com/r/linux_gaming/comments/81t838/sphinx_and_the_cursed_mummy_freezes_at_intro/dvjsvop/ The only thing that appears in the stdout/err log from the driver side is this: > i965: Failed to submit batchbuffer: Input/output error (https://www.reddit.com/r/linux_gaming/comments/81t838/sphinx_and_the_cursed_mummy_freezes_at_intro/dvjhza6/) Some of the other replies in that reddit thread may be of interest. I told him to use the LIBGL_DEBUG=verbose env var, but it doesn't look too chatty. (In reply to Swyter from comment #9) > @imirkin I'm just doing this, nothing too fancy: > > > glEnable(GL_DEBUG_OUTPUT); > > glEnable(GL_DEBUG_OUTPUT_SYNCHRONOUS); GL_ASSERT; > > glDebugMessageCallback(GLErrorCallback, NULL); GL_ASSERT; > > glDebugMessageControl( > > GL_DONT_CARE, > > GL_DONT_CARE, > > GL_DONT_CARE, > > 0, > > NULL, > > TRUE > > ); GL_ASSERT; > > If you take a look at the game log above you can see that it spits the > compiler info stats. Maybe I'm doing something wrong. A couple of stupid questions: 1. Could you be feeding the shader to glCompileShader before the above lines? 2. Could the end user have already cached the compilation of this shader? For the latter, one can run with MESA_GLSL_CACHE_DISABLE=1 to avoid using the cache. As an aside... there's no compilation bug here. Perhaps $other driver happens to get you a value of 0, but nothing guarantees that. Could just be luck in precisely how they do RA. Certainly nothing in the spec. And there are legitimate situations where a variable might be uninitialized prior to (compiler-proven) use, e.g. int x; int y = 0; loop { if (a) y += x; if (b) x = 5; } The compiler couldn't reasonably prove that a never happens before b does (except in some circumstances). Would it be the end of the world if one were to add code to zero-initialize all variables? No - but it'd add unnecessary code to otherwise functioning shaders. Mesa tends to stick to what's required by the spec. (You could argue that the "shader has gone into infinite loop" case should be handled better, with some success, but that's obviously a very driver-backend-specific issue.) Yeah. That was it, I only use a few shaders and they are all compiled at startup. Good to know that caching is implemented Mesa-wide now.
> *EE: [! OGL LOG ] type: 0, local: 0, gpr: 28, inst: 134, bytes: 1432
> *EE: [! OGL LOG ] type: 1, local: 0, gpr: 7, inst: 41, bytes: 440
> *EE: [! OGL LOG ] type: 0, local: 0, gpr: 16, inst: 17, bytes: 184
> *EE: [! OGL LOG ] type: 1, local: 0, gpr: 6, inst: 14, bytes: 152
> *EE: [! OGL LOG ] 0:54(14): warning: `i' used uninitialized
> *EE: [! OGL LOG ] 0:58(39): warning: `i' used uninitialized
> *EE: [! OGL LOG ] 0:59(53): warning: `i' used uninitialized
> *EE: [! OGL LOG ] 0:61(30): warning: `i' used uninitialized
> *EE: [! OGL LOG ] 0:62(33): warning: `i' used uninitialized
> *EE: [! OGL LOG ] 0:54(30): warning: `i' used uninitialized
> *EE: [! OGL LOG ] type: 0, local: 0, gpr: 24, inst: 158, bytes: 1688
> *EE: [! OGL LOG ] type: 1, local: 0, gpr: 23, inst: 110, bytes: 1176
> *EE: [! OGL LOG ] type: 0, local: 0, gpr: 20, inst: 59, bytes: 632
> *EE: [! OGL LOG ] type: 1, local: 0, gpr: 6, inst: 14, bytes: 152
I don't know about how the GLSL spec handles this kind of undefined behavior, and I agree with you that the implementation should not try to handhold shader authors. What we need is an ecosystem that detects these issues earlier (i.e. glslangValidator warnings) and responds robustly by detecting when the draw call/batch/job is idle spinning/stuck by having some kind of reasonable timeout and killing it with a descriptive message when things go out of hand, instead of making the entire system unusable. And this is where I'm probably talking about stuff I don't know about, but this could be probably enforced in the DRI/DRM side of things instead of being driver-specific.
At the very least I *think* DXGI does something similar, but don't quote me on that.
PS: As an early mitigation maybe the special case of self-incrementing/decrementing counter variables could be zero-initialized as fallback. You could think of this test case as fuzzing, DoS prevention. I bet it doesn't bloat the production shaders that you guys use for regression testing much.
Someone probably has to try this in a robust WebGL context and see what happens.
I have opened a glslang issue here, I thought that cross-referencing both reports might be interesting, for completeness: https://github.com/KhronosGroup/glslang/issues/1315 (In reply to Swyter from comment #12) > detecting when the draw call/batch/job is idle spinning/stuck by having some > kind of reasonable timeout and killing it with a descriptive message when > things go out of hand, instead of making the entire system unusable. And > this is where I'm probably talking about stuff I don't know about, but this > could be probably enforced in the DRI/DRM side of things instead of being > driver-specific. Each driver is a separate implementation of DRM. Each driver has a separate compiler. An infinite loop can happen without any warnings whatsoever (i.e. done on purpose), and drivers ought to handle it gracefully. But they don't. Killing jobs isn't always straightforward, GPU resets are a plain disaster. This has to be done very much in concert with the hardware, so each driver is on its own. I don't think any handle this well, but nouveau is probably particularly bad. (In reply to Ilia Mirkin from comment #14) > (In reply to Swyter from comment #12) > > detecting when the draw call/batch/job is idle spinning/stuck by having some > > kind of reasonable timeout and killing it with a descriptive message when > > things go out of hand, instead of making the entire system unusable. And > > this is where I'm probably talking about stuff I don't know about, but this > > could be probably enforced in the DRI/DRM side of things instead of being > > driver-specific. > > Each driver is a separate implementation of DRM. Each driver has a separate > compiler. An infinite loop can happen without any warnings whatsoever (i.e. > done on purpose), and drivers ought to handle it gracefully. > > But they don't. Killing jobs isn't always straightforward, GPU resets are a > plain disaster. This has to be done very much in concert with the hardware, > so each driver is on its own. I don't think any handle this well, but > nouveau is probably particularly bad. I think it works better in windows because you can just reset gpu, reload the driver, restart dwm (basically start from scratch). Albeit still need to identify the right job to kill. (FWIW the mileage on macs will vary too, depending on IHV or maybe even chip, it is definitely not flawless there neither - I think on chips not driving displays it generally works much better. Just confirming this isn't an easy problem.) (In reply to Ilia Mirkin from comment #11) > As an aside... there's no compilation bug here. Perhaps $other driver > happens to get you a value of 0, but nothing guarantees that. Could just be > luck in precisely how they do RA. Certainly nothing in the spec. > > And there are legitimate situations where a variable might be uninitialized > prior to (compiler-proven) use, e.g. > > int x; > int y = 0; > loop { > if (a) > y += x; > if (b) > x = 5; > } > > The compiler couldn't reasonably prove that a never happens before b does > (except in some circumstances). > > Would it be the end of the world if one were to add code to zero-initialize > all variables? No - but it'd add unnecessary code to otherwise functioning > shaders. > > Mesa tends to stick to what's required by the spec. The spec might not define what you should do in this case, but this means that we are free to handle the case as we see fit. Especially if we do something reasonable and consistent. If the compiler cannot reasonably prove that the variable will be initialized and the loop will finish, then maybe that is because it wont. It is possible that the shader would work properly only for a limited range of inputs. If these inputs are result of previous rendering, it might cause an extremely hard to trigger and reproduce bugs/hangs. With SSA and phi it is very easy to find when variable is used uninitialized and handle the case in deterministic way. (In reply to iive from comment #16) > With SSA and phi it is very easy to find when variable is used uninitialized > and handle the case in deterministic way. So initialize all those to MIN_INT? That'll work nicely. That's the thing - there is no single "correct" thing to do in such situations. for (float f; f > 0; f -= 1.0) { ... } You really can't predict it. There's no real point in trying. I bet that integers are the most common type by a wide margin. I also bet that most of these loops are meant to be unrolled and vectorized. Covering 100% of the cases is almost impossible, but initializing uninitialized variables inside of the condition block to some known/sane value like NULL/0.f is probably a 90% solution, because the value is going to be wrong and cause an infinite loop anyway. You can only win. Created attachment 138420 [details]
sorry, wrong attachment
Created attachment 138422 [details]
sorry, wrong attachment
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1024. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.