https://github.com/mc-imperial/shader-compiler-bugs/issues/75 Steps to Reproduce: ------------------------------- 0. This is reproducable with Mesa version 17.1.3, on Intel intergated GPUs. 1. Obtain and build the latest release of get-image-glsl, a simple tool that creates a .png image from a fragment shader. We need the `get_image_glfw` executable. https://github.com/graphicsfuzz/get-image-glsl 2. From a terminal, execute (this must be done in the directory where both .frag and .json files are): `/path/to/get_image_glfw original.frag` This will create output.png, which should look like original.png 3. From a terminal, execute (same remark as above): `/path/to/get_image_glfw variant.frag` This will create output.png, which should look like variant.png Expected Results: ------------------------------- Both shaders should render an image that looks like original.png. This is because the only difference between the .frag files is a 'do-while' loop that should execute only once: ``` $ diff -u original.frag variant.frag --- original.frag 2017-06-28 11:04:21.278754777 +0100 +++ variant.frag 2017-06-28 11:23:16.925591708 +0100 @@ -42,10 +42,12 @@ { s += _FLOAT_CONST[2]; } + do { if(float(i) >= limit) { return s; } + } while(gl_FragCoord.x < 0.0); } return s; } ``` Actual Results: ------------------------------- The shaders render completely different images, which they should not. Mesa code coverage: ------------------------------- Unzip the `coverage.html.zip` file to get a web page that present the comparative coverage of Mesa source code between the execution of the original and the variant shader. Additional Information: ------------------------------- Some info on the system where the bug was found: CPU: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz GPU: "2nd Generation Core Processor Family Integrated Graphics Controller" OS: Linux 4.8.0-56-generic #61~16.04.1-Ubuntu SMP Mesa version: 17.1.3
Repro'd with: OpenGL renderer string: Mesa DRI Intel(R) HD Graphics 530 (Skylake GT2) OpenGL core profile version string: 4.5 (Core Profile) Mesa 17.1.3 as well as current HEAD (ccfac28835). Works fine with llvmpipe, softpipe, and swr from the same mesa versions, so the issue is most likely somewhere in i965.
[Just some musings, figured I'd take a closer look... feel free to ignore.] GLSL seems fine (note use of return_flag): (loop ( (if (expression bool >= (expression float i2f (var_ref i@2) ) (swiz x (var_ref gl_FragCoord) )) ( (assign (x) (var_ref return_value) (var_ref s) ) (assign (x) (var_ref return_flag) (constant bool (1)) ) break ) ()) (if (expression bool >= (swiz x (var_ref gl_FragCoord) )(constant float (0.000000)) ) ( break ) ()) )) (if (var_ref return_flag) ( break ) And the NIR seems OK: (return_flag) vec1 32 ssa_39 = phi block_9: ssa_7, block_12: ssa_19 /* succs: block_16 block_17 */ if ssa_39 { block block_16: /* preds: block_15 */ break And ssa_7 is 0xffffffff as expected (from block_9 which is the one where return_flag is set to 1) ssa_19 is defined in the outer loop: vec1 32 ssa_19 = phi block_0: ssa_1, block_18: ssa_39 Where ssa_1 == 0, and ssa_39 is defined above. And the register-ized NIR seems fine too at first glance, at least as far as this return business is concerned. However looking at the SIMD8 shader, the control flow seems off. This is where I get a lot weaker, but I believe that I've found the register for the return_flag, but I don't see any control flow that depends on it. There's also only 1 while loop, rather than nested (it seems).
I guess this makes sense, but commenting out the opt_predicated_break pass in brw_fs.cpp:fs_visitor::optimize fixes it (at least this specific example).
opt_copy_propagate() is at fault. The additional control flow and the predicated break pass somehow leave the IR in a state opt_copy_propagate() mishandles.
Could the comparative code coverage between the different runs help to locate the bug root cause? The report covergage.html is available in the bug file archive, it is also temporary directly accessible at this URL -- yet it is a 35MB web page, so it may freeze your browser for a some seconds before being displayed: https://www.doc.ic.ac.uk/~hevrard/hidden_glfuzz/coverage_mesa-17-1-3_18May_130_noliterals_colorgrid_modulo_inv_variant_2/index.html This code coverage compares how many times each line of Mesa source code is executed when rendering each shader. It highlights when line execution counters are different (yellow), and especially when some lines are executed only during the rendering of one of the two shaders (red). Only files with different counters are reported, and they are ranked to display the source files where there is the biggest difference first. The line counters indicates first the number of hits when rendering the *original* shader, then the ones for the *variant* shader. We can see for instance that the top three source files where line execution differs are: src/intel/compiler/brw_cfg.cpp src/intel/brw_predicated_break.cpp src/compiler/nir_loop_analyze.c My lack of knowledge of Mesa internals makes it harder for me to feel what's going on, but for you Mesa developers, the red lines in these files may hint at the bug root cause?
Retested on master: commit 109de3049dda6be2a5a3910f777feea0bbf9ce92. Issue persists with SKL.
Created attachment 135074 [details] attachment-21457-0.html I am attending SPLASH this week in Vancouver, and will only have occasional access to email, so my responses may be slower than usual. Best wishes Ally
I am able to reproduce this issue on Skylake GT2. But it looks like it is already fixed. Tested the fallowing mesa versions: Mesa 17.3.6 (git-b3e5a3f35b) - FAIL Mesa 17.3.7 (git-8a51f3857c) - FAIL mesa-18.0.0-rc1 (e91e68d6a87a14) - OK Mesa 18.0.0 (git-dceb1ce807) - OK Mesa 18.1.0-devel (git-e7fc18097e) - OK Bisected and found the fix (tested it by reverting from the latest mesa master): commit 4d1959e69328cf0d59f0ec7aeea5a2b704ef0c5f Author: Francisco Jerez <currojerez@riseup.net> Date: Fri Oct 13 17:52:00 2017 -0700 intel/cfg: Represent divergent control flow paths caused by non-uniform loop execution. This addresses a long-standing back-end compiler bug that could lead to cross-channel data corruption in loops executed non-uniformly. In some cases live variables extending through a loop divergence point (e.g. a non-uniform break) into a convergence point (e.g. the end of the loop) wouldn't be considered live along all physical control flow paths the SIMD thread could possibly have taken in between due to some channels remaining in the loop for additional iterations. This patch fixes the problem by extending the CFG with physical edges that don't exist in the idealized non-vectorized program, but represent valid control flow paths the SIMD EU may take due to the divergence of logical threads. This makes sense because the i965 IR is explicitly SIMD, and it's not uncommon for instructions to have an influence on neighboring channels (e.g. a force_writemask_all header setup), so the behavior of the SIMD thread as a whole needs to be considered. No changes in shader-db. Reviewed-by: Jason Ekstrand <jason@jlekstrand.net> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
I've re-tested this on SKL-GT2 and can confirm that this specific repro no longer triggers this bug. The commit that was reverse-bisected to certainly seems to make sense given what I saw when I was originally looking as well as what Matt said. Marking this as FIXED. Thanks Curro!
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.