Created attachment 139956 [details] Test case that exhibits the issue If a value loaded from an SSBO is used multiple times, the ssbo_load gets copy propagated to each use. SSBO loads are expensive, so, while this may be technically correct, it is also bad. Arguments can be made that the test case is not correct, but that is an orthogonal issue. The core of the attached test case is: layout(binding = 0) buffer bufblock { int value; }; ... int f; /* This is an open-coded atomicAdd. */ do { f = value; } while (f != atomicCompSwap(value, f, f + 4)); For this code, the resulting NIR is (note the four load_ssbo operations): loop { block block_1: /* preds: block_0 block_4 */ vec1 32 ssa_4 = load_const (0x00000001 /* 0.000000 */) vec1 32 ssa_5 = iadd ssa_0, ssa_4 vec1 32 ssa_6 = intrinsic load_ssbo (ssa_5, ssa_0) () () vec1 32 ssa_7 = load_const (0x00000001 /* 0.000000 */) vec1 32 ssa_8 = iadd ssa_0, ssa_7 vec1 32 ssa_9 = intrinsic load_ssbo (ssa_8, ssa_0) () () vec1 32 ssa_10 = load_const (0x00000001 /* 0.000000 */) vec1 32 ssa_11 = iadd ssa_0, ssa_10 vec1 32 ssa_12 = intrinsic load_ssbo (ssa_11, ssa_0) () () vec1 32 ssa_13 = iadd ssa_12, ssa_1 vec1 32 ssa_14 = load_const (0x00000001 /* 0.000000 */) vec1 32 ssa_15 = iadd ssa_0, ssa_14 vec1 32 ssa_16 = intrinsic ssbo_atomic_comp_swap (ssa_15, ssa_0, ssa_9, ssa_13) () () vec1 32 ssa_17 = load_const (0x00000001 /* 0.000000 */) vec1 32 ssa_18 = iadd ssa_0, ssa_17 vec1 32 ssa_19 = intrinsic load_ssbo (ssa_18, ssa_0) () () vec1 32 ssa_20 = ieq ssa_19, ssa_16 /* succs: block_2 block_3 */ if ssa_20 { block block_2: /* preds: block_1 */ break /* succs: block_5 */ } else { block block_3: /* preds: block_1 */ /* succs: block_4 */ } block block_4: /* preds: block_3 */ /* succs: block_1 */ } The multiple reads get a mishmosh of values (written by other threads), and Annotating 'value' with volatile helps, but the generated code still wrong. There should definitely *NOT* be a load_ssbo after the ssbo_atomic_comp_swap! loop { block block_1: /* preds: block_0 block_4 */ vec1 32 ssa_4 = intrinsic load_ssbo (ssa_2, ssa_0) () () vec1 32 ssa_5 = intrinsic load_ssbo (ssa_2, ssa_0) () () vec1 32 ssa_6 = intrinsic load_ssbo (ssa_2, ssa_0) () () vec1 32 ssa_7 = iadd ssa_6, ssa_1 vec1 32 ssa_8 = intrinsic ssbo_atomic_comp_swap (ssa_2, ssa_0, ssa_5, ssa_7) () () vec1 32 ssa_9 = intrinsic load_ssbo (ssa_2, ssa_0) () () vec1 32 ssa_10 = ieq ssa_9, ssa_8 /* succs: block_2 block_3 */ if ssa_10 { block block_2: /* preds: block_1 */ break /* succs: block_5 */ } else { block block_3: /* preds: block_1 */ /* succs: block_4 */ } block block_4: /* preds: block_3 */ /* succs: block_1 */ } Completely re-writing the loop as int f = value; int a; do { a = f; } while ((f = atomicCompSwap(value, f, f + 4)) != a); produces the expected results: vec1 32 ssa_4 = intrinsic load_ssbo (ssa_2, ssa_0) () () /* succs: block_1 */ loop { block block_1: /* preds: block_0 block_4 */ vec1 32 ssa_5 = phi block_0: ssa_4, block_4: ssa_7 vec1 32 ssa_6 = iadd ssa_5, ssa_1 vec1 32 ssa_7 = intrinsic ssbo_atomic_comp_swap (ssa_2, ssa_0, ssa_5, ssa_6) () () vec1 32 ssa_8 = ieq ssa_7, ssa_5 /* succs: block_2 block_3 */ if ssa_8 { block block_2: /* preds: block_1 */ break /* succs: block_5 */ } else { block block_3: /* preds: block_1 */ /* succs: block_4 */ } block block_4: /* preds: block_3 */ /* succs: block_1 */ }
Upon further inspection, it turns out this happens before NIR even sees it.
Are you sure that's illegal? I'd imagine that you'd need that variable to be marked as coherent, or to introduce a compiler barrier somehow. One could also argue that an atomic should introduce such a barrier explicitly -- I think in C they tend to act as compiler barriers.
(In reply to Ilia Mirkin from comment #2) > Are you sure that's illegal? I'd imagine that you'd need that variable to be > marked as coherent, or to introduce a compiler barrier somehow. Legality isn't really relevant. The shader source contains one instance of a very, very expensive operation, and we decide to do it four times. That's unacceptable.
(In reply to Ian Romanick from comment #3) > (In reply to Ilia Mirkin from comment #2) > > Are you sure that's illegal? I'd imagine that you'd need that variable to be > > marked as coherent, or to introduce a compiler barrier somehow. > > Legality isn't really relevant. The shader source contains one instance of > a very, very expensive operation, and we decide to do it four times. That's > unacceptable. Oh wait. I totally misread. Copy propagate. And it does so ACROSS an atomic, no less! That's definitely broken
I'd be a fan of just making GLSL IR bail in copy propagation of anything that isn't a local, global, or uniform. We could try to make it smart enough to handle barriers correctly with SSBOs and shared variables but that's very hard to get right.
Also, please turn your test case there into a piglit test. Extra points if it's a CTS test so we force everyone else to be correct too. :-)
(In reply to Jason Ekstrand from comment #6) > Also, please turn your test case there into a piglit test. Extra points if > it's a CTS test so we force everyone else to be correct too. :-) I'm not sure if you looked at it, but it is a piglit shader_runner test. The current failure mode is many (recoverable) GPU hangs.
(In reply to Ian Romanick from comment #7) > (In reply to Jason Ekstrand from comment #6) > > Also, please turn your test case there into a piglit test. Extra points if > > it's a CTS test so we force everyone else to be correct too. :-) > > I'm not sure if you looked at it, but it is a piglit shader_runner test. > The current failure mode is many (recoverable) GPU hangs. Which I should point out are caused by infinite loops in the shader... not some other problem. :)
(In reply to Ian Romanick from comment #8) > (In reply to Ian Romanick from comment #7) > > (In reply to Jason Ekstrand from comment #6) > > > Also, please turn your test case there into a piglit test. Extra points if > > > it's a CTS test so we force everyone else to be correct too. :-) > > > > I'm not sure if you looked at it, but it is a piglit shader_runner test. > > The current failure mode is many (recoverable) GPU hangs. I figured. What other kind of simple shader test is there? That was more of a "Once you fix it, please commit to piglit". > Which I should point out are caused by infinite loops in the shader... not > some other problem. :) Those are very good source of hangs. They're actually debuggable. :)
(In reply to Jason Ekstrand from comment #9) > > > I'm not sure if you looked at it, but it is a piglit shader_runner test. > > > The current failure mode is many (recoverable) GPU hangs. > > I figured. What other kind of simple shader test is there? That was more > of a "Once you fix it, please commit to piglit". Please don't add piglit tests which are expected to hang the GPU, or at least don't set them up to be run by default. Not all drivers can reliably recover from GPU hangs.
Even with the fixes to the GLSL copy propagation (and other passes) logic, the shader was still hanging. After a debugging session with Jason, he figured that the helper invocations were not making progress, getting into an infinite loop. While the reads of value give correct values, the result from atomicCompareSwap is undefined for helper invocations, and comparing two was always failing. The alternative version of the test in the report (the one that works), avoids this issue since the return value for atomicCompareSwap seems to be always the same (although not something we should rely on, I guess), allowing the loop to end. Early returning based on gl_HelperInvocation makes things work with the original loop, but depends on newer GL version.
(In reply to Michel Dänzer from comment #10) > Please don't add piglit tests which are expected to hang the GPU, or at > least don't set them up to be run by default. Not all drivers can reliably > recover from GPU hangs. IMHO those drivers should be fixed.
(In reply to almos from comment #12) > (In reply to Michel Dänzer from comment #10) > > Please don't add piglit tests which are expected to hang the GPU, or at > > least don't set them up to be run by default. Not all drivers can reliably > > recover from GPU hangs. > > IMHO those drivers should be fixed. Hang recovery is kernel functionality, not Mesa. There can be Mesa drivers for HW with experimental kernel drivers, which don't have reliable hang recovery yet (and HW could be buggy too). It would be interesting to know which of the current (kernel) drivers have known non-recoverable GPU hangs though.
(In reply to Eero Tamminen from comment #13) > Hang recovery is kernel functionality, not Mesa. There can be Mesa drivers > for HW with experimental kernel drivers, which don't have reliable hang > recovery yet (and HW could be buggy too). > > It would be interesting to know which of the current (kernel) drivers have > known non-recoverable GPU hangs though. A few examples: radeon, amdgpu, nouveau.
(In reply to Michel Dänzer from comment #10) > (In reply to Jason Ekstrand from comment #9) > > > > I'm not sure if you looked at it, but it is a piglit shader_runner test. > > > > The current failure mode is many (recoverable) GPU hangs. > > > > I figured. What other kind of simple shader test is there? That was more > > of a "Once you fix it, please commit to piglit". > > Please don't add piglit tests which are expected to hang the GPU, or at > least don't set them up to be run by default. Not all drivers can reliably > recover from GPU hangs. The intention is not to add a test that hangs. The two required fixes are to the test and GLSL IR. Once those are made, it shouldn't hang for anyone. As far as general policy goes, no we don't want to have GPU hangs. Maybe we could even go so far to say that if you have a test which has a high probability of hanging the GPU, it shouldn't be added to the test list by default. However there are many kinds of tests which might hang the GPU if the driver has a bug as no fault of the test. As someone who works on a very hangy GPU, I'm keenly aware of this. :-) That doesn't mean we shouldn't land those tests; it means we should fix those drivers.
(In reply to Caio Marcelo de Oliveira Filho from comment #11) > Even with the fixes to the GLSL copy propagation (and other passes) logic, > the shader was still hanging. After a debugging session with Jason, he > figured that the helper invocations were not making progress, getting into > an infinite loop. Per requested created https://bugs.freedesktop.org/show_bug.cgi?id=106902 for this particular issue with helper invocations.
I have sent fixes to the mesa-dev list: https://patchwork.freedesktop.org/series/44661/ And an updated version of the test case (that avoids the helper pixels problem) to the piglit list: https://patchwork.freedesktop.org/patch/229132/
This bug should be fixed by the commits listed below. (In reply to Michel Dänzer from comment #10) > Please don't add piglit tests which are expected to hang the GPU, or at > least don't set them up to be run by default. Not all drivers can reliably > recover from GPU hangs. The updated test does not cause GPU hangs with these patches applied. Are you ok with the test landing in piglit master? commit 37bd9ccd21b860d2b5ffea7e1f472ec83b68b43b (HEAD -> bug-106774, origin/master, origin/HEAD) Author: Ian Romanick <ian.d.romanick@intel.com> Date: Tue Jun 5 16:02:25 2018 -0700 glsl: Don't copy propagate elements from SSBO or shared variables either Since SSBOs can be written by a different GPU thread, copy propagating a read can cause the value to magically change. SSBO reads are also very expensive, so doing it twice will be slower. The same shader was helped by this patch and the previous. Haswell, Broadwell, and Skylake had similar results. (Skylake shown) total instructions in shared programs: 14399119 -> 14399113 (<.01%) instructions in affected programs: 683 -> 677 (-0.88%) helped: 1 HURT: 0 total cycles in shared programs: 532973113 -> 532971865 (<.01%) cycles in affected programs: 524666 -> 523418 (-0.24%) helped: 1 HURT: 0 Signed-off-by: Ian Romanick <ian.d.romanick@intel.com> Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com> Cc: mesa-stable@lists.freedesktop.org Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106774 commit 461a5c899c08064467abb635536381a5a5659280 Author: Ian Romanick <ian.d.romanick@intel.com> Date: Tue Jun 5 15:04:24 2018 -0700 glsl: Don't copy propagate from SSBO or shared variables either Since SSBOs can be written by other GPU threads, copy propagating a read can cause the value to magically change. SSBO reads are also very expensive, so doing it twice will be slower. Haswell, Broadwell, and Skylake had similar results. (Skylake shown) total instructions in shared programs: 14399120 -> 14399119 (<.01%) instructions in affected programs: 684 -> 683 (-0.15%) helped: 1 HURT: 0 total cycles in shared programs: 532978931 -> 532973113 (<.01%) cycles in affected programs: 530484 -> 524666 (-1.10%) helped: 1 HURT: 0 Signed-off-by: Ian Romanick <ian.d.romanick@intel.com> Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com> Cc: mesa-stable@lists.freedesktop.org Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106774
(In reply to Ian Romanick from comment #18) > The updated test does not cause GPU hangs with these patches applied. Are > you ok with the test landing in piglit master? Sure, thanks.
FWIW, this sounds a lot like the problem we tried to address in NIR with a load-combine pass some time ago here: https://lists.freedesktop.org/archives/mesa-dev/2016-April/112925.html
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.