Summary: | [GLSL] lower_jumps with continue-statements in for-loops prevents loop unrolling | ||
---|---|---|---|
Product: | Mesa | Reporter: | Shuang He <shuang.he> |
Component: | glsl-compiler | Assignee: | mesa-dev |
Status: | RESOLVED FIXED | QA Contact: | Intel 3D Bugs Mailing List <intel-3d-bugs> |
Severity: | normal | ||
Priority: | low | CC: | christophe.prigent, danylo.piliaiev, idr, kenneth, stereotype441 |
Version: | git | ||
Hardware: | Other | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
piglit test case
Removing unnecessary continue |
Description
Shuang He
2010-12-07 21:18:17 UTC
Created attachment 40894 [details]
piglit test case
This failure will happen for any driver that needs if-statements to be flattened. Reassigning to glsl-compiler. I committed glsl-{vs,fs}-loop-continue.shader_test to reproduce this failure. It appears that this failure still exists on Mesa master. It looks like the i915 driver isn't setting EmitNoCond or EmitNoBreak. Perhaps we want to do that? This is actually a really, really bad bug in lower_jumps. Early in compile time we convert for (int i=0; i<8; i++) { if (i > 4) { gl_FragColor = vec4(0.0, 1.0, 0.0, 0.0); continue; } gl_FragColor = vec4(1.0, 0.0, 0.0, 0.0); } into int i = 0; for (;;) { if (!(i < 8)) break; if (i > 4) { gl_FragColor = vec4(0.0, 1.0, 0.0, 0.0); continue; } gl_FragColor = vec4(1.0, 0.0, 0.0, 0.0); i++; } Then lower_jumps comes along and converts it into: int i = 0; for (;;) { bool continue_flag = false; if (!(i < 8)) break; if (i > 4) { gl_FragColor = vec4(0.0, 1.0, 0.0, 0.0); continue_flag = true; } if (!continue_flag) { gl_FragColor = vec4(1.0, 0.0, 0.0, 0.0); i++; } } If we execute the continue-statement, we don't execute the loop increment code. It's not quite that bad. It looks like we anticipated this sort of problem. When a continue statement is emitted the loop-increment code is emitted (again) in place. So we actually get int i = 0; for (;;) { if (!(i < 8)) break; if (i > 4) { gl_FragColor = vec4(0.0, 1.0, 0.0, 0.0); i++; continue; } gl_FragColor = vec4(1.0, 0.0, 0.0, 0.0); i++; } Then lower_jumps comes along and converts it into: int i = 0; for (;;) { bool continue_flag = false; if (!(i < 8)) break; if (i > 4) { gl_FragColor = vec4(0.0, 1.0, 0.0, 0.0); i++; continue_flag = true; } if (!continue_flag) { gl_FragColor = vec4(1.0, 0.0, 0.0, 0.0); i++; } } Since there are multiple (identical) assignments to i, the loop doesn't get unrolled. Paul, did you by chance fix this when you were working on lower_jumps a while ago? (In reply to comment #7) > Paul, did you by chance fix this when you were working on lower_jumps a while > ago? No, I don't believe I did. In fact, it seems to me that this bug has nothing to do with jump lowering. As Ian mentioned in comment #6, the reason the loop isn't getting unrolled is because there are multiple assignments to i, and that was true both before and after jump lowering. I suspect the only feasible ways to fix this bug are (a) make loop analysis much more sophisticated, and (b) make the IR richer so that it's not necessary to duplicate the loop increment. Unfortunately, both of those options would be fairly significant undertakings. The core issue here is that the ir_loop data structure is not sufficiently rich to keep track of which parts of a loop are part of the loop body and which parts are part of the loop increment. Which means when the code is converted from AST to HIR, this: for (initialization; condition; increment) { body; } is effectively converted to this: initialization; for (;;) { if (!condition) { break; } body; increment; } This creates a problem: if a continue statement were to appear inside the loop body, it would skip the increment, most likely resulting in an infinite loop. To avoid that, there is a hack in the ast-to-hir code for the continue statement which automatically re-emits the loop increment just before emitting the continue. This causes correct loop behavior, but as Ian previously mentioned in comment 6, defeats loop unrolling. I think the right solution to this bug is to expand the IR so that ir_loop maintains the loop body separately from the increment. That way we won't have to do the hack in the ast-to-hir for continue statements, and the loop will be unrollable. Unfortunately, this is a large undertaking, since it involves modifying all lowering passes and optimizations that act on loops, to make sure they handle the loop increment properly. It's far to risky to do before the Mesa 8.0 release. I'm removing this bug from the list of Mesa 8.0 release blockers. I'll make a note to myself to revisit this bug after the Mesa 8.0 release. I'm assigning this to myself and will be taking a look at it as part of my work on a loop unrolling pass for NIR. With NIR we end up with : loop { block block_1: /* preds: block_0 block_5 block_7 */ vec1 32 ssa_8 = phi block_0: ssa_3, block_5: ssa_4, block_7: ssa_19 vec1 32 ssa_9 = phi block_0: ssa_2, block_5: ssa_17, block_7: ssa_4 vec1 32 ssa_10 = phi block_0: ssa_1, block_5: ssa_4, block_7: ssa_4 vec1 32 ssa_11 = phi block_0: ssa_0, block_5: ssa_4, block_7: ssa_4 vec1 32 ssa_12 = phi block_0: ssa_4, block_5: ssa_16, block_7: ssa_18 vec4 32 ssa_13 = vec4 ssa_8, ssa_9, ssa_10, ssa_11 vec1 32 ssa_14 = ige ssa_12, ssa_5 /* succs: block_2 block_3 */ if ssa_14 { block block_2: /* preds: block_1 */ break /* succs: block_8 */ } else { block block_3: /* preds: block_1 */ /* succs: block_4 */ } block block_4: /* preds: block_3 */ vec1 32 ssa_15 = ilt ssa_6, ssa_12 /* succs: block_5 block_6 */ if ssa_15 { block block_5: /* preds: block_4 */ vec1 32 ssa_16 = iadd ssa_12, ssa_7 vec1 32 ssa_17 = load_const (0x3f800000 /* 1.000000 */) continue /* succs: block_1 */ } else { block block_6: /* preds: block_4 */ /* succs: block_7 */ } block block_7: /* preds: block_6 */ vec1 32 ssa_18 = iadd ssa_12, ssa_7 vec1 32 ssa_19 = load_const (0x3f800000 /* 1.000000 */) /* succs: block_1 */ } So all we need to do is move everything after the if into the else block and remove the continue. Removing myself as assignee, this would probably be a good beginners task. (In reply to Timothy Arceri from comment #11) > > So all we need to do is move everything after the if into the else block and > remove the continue. Removing myself as assignee, this would probably be a > good beginners task. Hi, I've tried to do this and it works for me however it alone doesn't solve the problem. Consider the resulting nir: loop { block block_1: /* preds: block_0 block_7 */ vec1 32 ssa_8 = phi block_0: ssa_4, block_7: ssa_20 vec1 32 ssa_9 = phi block_0: ssa_0, block_7: ssa_4 vec1 32 ssa_10 = phi block_0: ssa_1, block_7: ssa_4 vec1 32 ssa_11 = phi block_0: ssa_2, block_7: ssa_21 vec1 32 ssa_12 = phi block_0: ssa_3, block_7: ssa_22 vec4 32 ssa_13 = vec4 ssa_12, ssa_11, ssa_10, ssa_9 vec1 32 ssa_14 = ige ssa_8, ssa_5 /* succs: block_2 block_3 */ if ssa_14 { block block_2: /* preds: block_1 */ break /* succs: block_8 */ } else { block block_3: /* preds: block_1 */ /* succs: block_4 */ } block block_4: /* preds: block_3 */ vec1 32 ssa_15 = ilt ssa_6, ssa_8 /* succs: block_5 block_6 */ if ssa_15 { block block_5: /* preds: block_4 */ vec1 32 ssa_16 = iadd ssa_8, ssa_7 vec1 32 ssa_17 = load_const (0x3f800000 /* 1.000000 */) /* succs: block_7 */ } else { block block_6: /* preds: block_4 */ vec1 32 ssa_18 = iadd ssa_8, ssa_7 vec1 32 ssa_19 = load_const (0x3f800000 /* 1.000000 */) /* succs: block_7 */ } block block_7: /* preds: block_5 block_6 */ vec1 32 ssa_20 = phi block_5: ssa_16, block_6: ssa_18 vec1 32 ssa_21 = phi block_5: ssa_17, block_6: ssa_4 vec1 32 ssa_22 = phi block_5: ssa_4, block_6: ssa_19 /* succs: block_1 */ } Now in both "if" (block_5) and "else" (block_6) blocks there are identical expressions and no continue. However there is no optimization to pull these identical expressions out - CSE pass won't do this since it's a local CSE, only global CSE would help. And there is no active global CSE pass, there is only Global Code Motion pass with Global Value Numbering and it is not enabled, enabling it optimizes the condition in question and in the end whole loop disappears as expected, however this pass doesn't look something we want since in other cases it hurts shaders and it is more than just global CSE. Any opinions on this? (In reply to Danylo from comment #12) > (In reply to Timothy Arceri from comment #11) > > > > So all we need to do is move everything after the if into the else block and > > remove the continue. Removing myself as assignee, this would probably be a > > good beginners task. > Hi, > > I've tried to do this and it works for me however it alone doesn't solve the > problem. > > Consider the resulting nir: > > loop { > block block_1: > /* preds: block_0 block_7 */ > vec1 32 ssa_8 = phi block_0: ssa_4, block_7: ssa_20 > vec1 32 ssa_9 = phi block_0: ssa_0, block_7: ssa_4 > vec1 32 ssa_10 = phi block_0: ssa_1, block_7: ssa_4 > vec1 32 ssa_11 = phi block_0: ssa_2, block_7: ssa_21 > vec1 32 ssa_12 = phi block_0: ssa_3, block_7: ssa_22 > vec4 32 ssa_13 = vec4 ssa_12, ssa_11, ssa_10, ssa_9 > vec1 32 ssa_14 = ige ssa_8, ssa_5 > /* succs: block_2 block_3 */ > if ssa_14 { > block block_2: > /* preds: block_1 */ > break > /* succs: block_8 */ > } else { > block block_3: > /* preds: block_1 */ > /* succs: block_4 */ > } > block block_4: > /* preds: block_3 */ > vec1 32 ssa_15 = ilt ssa_6, ssa_8 > /* succs: block_5 block_6 */ > if ssa_15 { > block block_5: > /* preds: block_4 */ > vec1 32 ssa_16 = iadd ssa_8, ssa_7 > vec1 32 ssa_17 = load_const (0x3f800000 /* 1.000000 > */) > /* succs: block_7 */ > } else { > block block_6: > /* preds: block_4 */ > vec1 32 ssa_18 = iadd ssa_8, ssa_7 > vec1 32 ssa_19 = load_const (0x3f800000 /* 1.000000 > */) > /* succs: block_7 */ > } > block block_7: > /* preds: block_5 block_6 */ > vec1 32 ssa_20 = phi block_5: ssa_16, block_6: ssa_18 > vec1 32 ssa_21 = phi block_5: ssa_17, block_6: ssa_4 > vec1 32 ssa_22 = phi block_5: ssa_4, block_6: ssa_19 > /* succs: block_1 */ > } > > Now in both "if" (block_5) and "else" (block_6) blocks there are identical > expressions and no continue. > However there is no optimization to pull these identical expressions out - > CSE pass won't do this since it's a local CSE, only global CSE would help. > And there is no active global CSE pass, there is only Global Code Motion > pass with Global Value Numbering and it is not enabled, enabling it > optimizes the condition in question and in the end whole loop disappears as > expected, however this pass doesn't look something we want since in other > cases it hurts shaders and it is more than just global CSE. > > Any opinions on this? None of that should matter. If the continue if removed there should be nothing stopping the loop from unrolling, and if the loop is unrolled the both ifs should be able to be optimised away (assuming I'm reading the IR correctly). Is this not what you are seeing? (In reply to Timothy Arceri from comment #13) > > None of that should matter. If the continue if removed there should be > nothing stopping the loop from unrolling, and if the loop is unrolled the > both ifs should be able to be optimised away (assuming I'm reading the IR > correctly). Is this not what you are seeing? Unfortunately not, loop isn't unrolled. To be on the same page the optimization I did is turning loop { ... if (cond) { do_work_1(); continue; } else { } do_work_2(); } into: loop { ... if (cond) { do_work_1(); } else { do_work_2(); } } So in our case it effectively produces: ... if (cond) { i++; do_work_1(); } else { i++; do_work_2(); } ... Looks like in previous comment I forgot to say that both branches have 'i++'. Loop with such condition couldn't be unrolled because 'compute_induction_information' could not find induction variable because 'i' is in a control flow > /* If one of the sources is in a conditional or nested block then > * panic. > */ > if (src_var->in_control_flow) > break; To make loop unrollable 'i++' should be outside of conditional block and there is no optimization that could pull it out as I wrote in previous comment. Created attachment 142567 [details] [review] Removing unnecessary continue Optimization in question. (In reply to Danylo from comment #15) > Created attachment 142567 [details] [review] [review] > Removing unnecessary continue > > Optimization in question. Thanks! I've tried a few variations of this patch and eventually settled on a v2 which I sent out with an update to loop analysis so that we can detect the duplicate additions. https://patchwork.freedesktop.org/series/53128/ Should be fixed as of: commit 9e6b39e1d521aa723749a47d958d58900bf25138 (HEAD -> master, origin/master, origin/HEAD) Author: Timothy Arceri <tarceri@itsqueeze.com> Date: Mon Nov 26 12:05:00 2018 +1100 nir: detect more induction variables This allows loop analysis to detect inductions variables that are incremented in both branches of an if rather than in a main loop block. For example: loop { block block_1: /* preds: block_0 block_7 */ vec1 32 ssa_8 = phi block_0: ssa_4, block_7: ssa_20 vec1 32 ssa_9 = phi block_0: ssa_0, block_7: ssa_4 vec1 32 ssa_10 = phi block_0: ssa_1, block_7: ssa_4 vec1 32 ssa_11 = phi block_0: ssa_2, block_7: ssa_21 vec1 32 ssa_12 = phi block_0: ssa_3, block_7: ssa_22 vec4 32 ssa_13 = vec4 ssa_12, ssa_11, ssa_10, ssa_9 vec1 32 ssa_14 = ige ssa_8, ssa_5 /* succs: block_2 block_3 */ if ssa_14 { block block_2: /* preds: block_1 */ break /* succs: block_8 */ } else { block block_3: /* preds: block_1 */ /* succs: block_4 */ } block block_4: /* preds: block_3 */ vec1 32 ssa_15 = ilt ssa_6, ssa_8 /* succs: block_5 block_6 */ if ssa_15 { block block_5: /* preds: block_4 */ vec1 32 ssa_16 = iadd ssa_8, ssa_7 vec1 32 ssa_17 = load_const (0x3f800000 /* 1.000000*/) /* succs: block_7 */ } else { block block_6: /* preds: block_4 */ vec1 32 ssa_18 = iadd ssa_8, ssa_7 vec1 32 ssa_19 = load_const (0x3f800000 /* 1.000000*/) /* succs: block_7 */ } block block_7: /* preds: block_5 block_6 */ vec1 32 ssa_20 = phi block_5: ssa_16, block_6: ssa_18 vec1 32 ssa_21 = phi block_5: ssa_17, block_6: ssa_4 vec1 32 ssa_22 = phi block_5: ssa_4, block_6: ssa_19 /* succs: block_1 */ } Unfortunatly GCM could move the addition out of the if for us (making this patch unrequired) but we still cannot enable the GCM pass without regressions. This unrolls a loop in Rise of The Tomb Raider. vkpipeline-db results (VEGA): Totals from affected shaders: SGPRS: 88 -> 96 (9.09 %) VGPRS: 56 -> 52 (-7.14 %) Spilled SGPRs: 0 -> 0 (0.00 %) Spilled VGPRs: 0 -> 0 (0.00 %) Private memory VGPRs: 0 -> 0 (0.00 %) Scratch size: 0 -> 0 (0.00 %) dwords per thread Code Size: 2168 -> 4560 (110.33 %) bytes LDS: 0 -> 0 (0.00 %) blocks Max Waves: 4 -> 4 (0.00 %) Wait states: 0 -> 0 (0.00 %) Reviewed-by: Thomas Helland <thomashelland90@gmail.com> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=32211 |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.