Bug 32211

Summary: [GLSL] lower_jumps with continue-statements in for-loops prevents loop unrolling
Product: Mesa Reporter: Shuang He <shuang.he>
Component: glsl-compilerAssignee: mesa-dev
Status: RESOLVED FIXED QA Contact: Intel 3D Bugs Mailing List <intel-3d-bugs>
Severity: normal    
Priority: low CC: christophe.prigent, danylo.piliaiev, idr, kenneth, stereotype441
Version: git   
Hardware: Other   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: piglit test case
Removing unnecessary continue

Description Shuang He 2010-12-07 21:18:17 UTC
System Environment:
--------------------------
Libdrm:        (master)2.4.22-16-g51b895041c65f7ec9ecda48e79279dde29258b07
Mesa:        (master)1eb7a81f2e43842acd59929ce65db2142b69134d
Xserver:       
(master)xorg-server-1.9.0-330-g4e0f8f666e61390206c42ad2087477a912525bc7
Xf86_video_intel:       
(master)2.13.901-5-g0bb135c40e5ac1bf7593ec1d68d2815cbf47aa25


Bug detailed description:
-------------------------
This bug hapens on Pineview. 
This issue happens to both OpenGL GLSL and OpenGL ES 2.0 GLSL

The reported error is:
Failed to link:
Couldn't flatten if statement

PIGLIT: {'result': 'fail' }

Following case could be used to reproduce the issue:
[require]
GLSL >= 1.00

[vertex shader]
attribute vec4 piglit_Position;

void main()
{
        gl_Position = piglit_Position;
}

[fragment shader]
void main()
{
        for (int i=0; i<8; i++)
        {
                if (i > 4) {
                        gl_FragColor = vec4(0.0, 1.0, 0.0, 0.0);
                        continue;
                }
                gl_FragColor = vec4(1.0, 0.0, 0.0, 0.0);
        }
}

[test]
draw rect -1 -1 2 2
relative probe rgb (0.03, 0.25) (0.0, 1.0, 0.0)
Comment 1 Shuang He 2010-12-07 21:20:01 UTC
Created attachment 40894 [details]
piglit test case
Comment 2 Ian Romanick 2010-12-10 20:00:17 UTC
This failure will happen for any driver that needs if-statements to be flattened.  Reassigning to glsl-compiler.
Comment 3 Ian Romanick 2010-12-13 16:25:41 UTC
I committed glsl-{vs,fs}-loop-continue.shader_test to reproduce this failure.  It appears that this failure still exists on Mesa master.
Comment 4 Kenneth Graunke 2010-12-14 01:10:11 UTC
It looks like the i915 driver isn't setting EmitNoCond or EmitNoBreak.  Perhaps we want to do that?
Comment 5 Ian Romanick 2011-07-25 12:45:52 UTC
This is actually a really, really bad bug in lower_jumps.

Early in compile time we convert

    for (int i=0; i<8; i++) {
        if (i > 4) {
             gl_FragColor = vec4(0.0, 1.0, 0.0, 0.0);
             continue;
        }
        gl_FragColor = vec4(1.0, 0.0, 0.0, 0.0);
    }

into

    int i = 0;
    for (;;) {
        if (!(i < 8))
            break;

        if (i > 4) {
            gl_FragColor = vec4(0.0, 1.0, 0.0, 0.0);
            continue;
        }
        gl_FragColor = vec4(1.0, 0.0, 0.0, 0.0);
        i++;
    }

Then lower_jumps comes along and converts it into:

    int i = 0;
    for (;;) {
        bool continue_flag = false;

        if (!(i < 8))
            break;

        if (i > 4) {
            gl_FragColor = vec4(0.0, 1.0, 0.0, 0.0);
            continue_flag = true;
        }

        if (!continue_flag) {
            gl_FragColor = vec4(1.0, 0.0, 0.0, 0.0);
            i++;
        }
    }

If we execute the continue-statement, we don't execute the loop increment code.
Comment 6 Ian Romanick 2011-07-25 12:52:13 UTC
It's not quite that bad.  It looks like we anticipated this sort of problem.  When a continue statement is emitted the loop-increment code is emitted (again) in place.  So we actually get

    int i = 0;
    for (;;) {
        if (!(i < 8))
            break;

        if (i > 4) {
            gl_FragColor = vec4(0.0, 1.0, 0.0, 0.0);
            i++;
            continue;
        }
        gl_FragColor = vec4(1.0, 0.0, 0.0, 0.0);
        i++;
    }

Then lower_jumps comes along and converts it into:

    int i = 0;
    for (;;) {
        bool continue_flag = false;

        if (!(i < 8))
            break;

        if (i > 4) {
            gl_FragColor = vec4(0.0, 1.0, 0.0, 0.0);
            i++;
            continue_flag = true;
        }

        if (!continue_flag) {
            gl_FragColor = vec4(1.0, 0.0, 0.0, 0.0);
            i++;
        }
    }

Since there are multiple (identical) assignments to i, the loop doesn't get unrolled.
Comment 7 Kenneth Graunke 2011-08-11 00:17:44 UTC
Paul, did you by chance fix this when you were working on lower_jumps a while ago?
Comment 8 Paul Berry 2011-08-11 08:21:19 UTC
(In reply to comment #7)
> Paul, did you by chance fix this when you were working on lower_jumps a while
> ago?

No, I don't believe I did.  In fact, it seems to me that this bug has nothing to do with jump lowering.  As Ian mentioned in comment #6, the reason the loop isn't getting unrolled is because there are multiple assignments to i, and that was true both before and after jump lowering.

I suspect the only feasible ways to fix this bug are (a) make loop analysis much more sophisticated, and (b) make the IR richer so that it's not necessary to duplicate the loop increment.  Unfortunately, both of those options would be fairly significant undertakings.
Comment 9 Paul Berry 2012-01-16 17:42:29 UTC
The core issue here is that the ir_loop data structure is not sufficiently rich to keep track of which parts of a loop are part of the loop body and which parts are part of the loop increment.  Which means when the code is converted from AST to HIR, this:

for (initialization; condition; increment) {
  body;
}

is effectively converted to this:

initialization;
for (;;) {
  if (!condition) {
    break;
  }
  body;
  increment;
}

This creates a problem: if a continue statement were to appear inside the loop body, it would skip the increment, most likely resulting in an infinite loop.  To avoid that, there is a hack in the ast-to-hir code for the continue statement which automatically re-emits the loop increment just before emitting the continue.  This causes correct loop behavior, but as Ian previously mentioned in comment 6, defeats loop unrolling.

I think the right solution to this bug is to expand the IR so that ir_loop maintains the loop body separately from the increment.  That way we won't have to do the hack in the ast-to-hir for continue statements, and the loop will be unrollable.

Unfortunately, this is a large undertaking, since it involves modifying all lowering passes and optimizations that act on loops, to make sure they handle the loop increment properly.  It's far to risky to do before the Mesa 8.0 release.  I'm removing this bug from the list of Mesa 8.0 release blockers.  I'll make a note to myself to revisit this bug after the Mesa 8.0 release.
Comment 10 Timothy Arceri 2016-09-06 11:03:02 UTC
I'm assigning this to myself and will be taking a look at it as part of my work on a loop unrolling pass for NIR.
Comment 11 Timothy Arceri 2018-04-12 11:04:58 UTC
With NIR we end up with :

	loop {
		block block_1:
		/* preds: block_0 block_5 block_7 */
		vec1 32 ssa_8 = phi block_0: ssa_3, block_5: ssa_4, block_7: ssa_19
		vec1 32 ssa_9 = phi block_0: ssa_2, block_5: ssa_17, block_7: ssa_4
		vec1 32 ssa_10 = phi block_0: ssa_1, block_5: ssa_4, block_7: ssa_4
		vec1 32 ssa_11 = phi block_0: ssa_0, block_5: ssa_4, block_7: ssa_4
		vec1 32 ssa_12 = phi block_0: ssa_4, block_5: ssa_16, block_7: ssa_18
		vec4 32 ssa_13 = vec4 ssa_8, ssa_9, ssa_10, ssa_11
		vec1 32 ssa_14 = ige ssa_12, ssa_5
		/* succs: block_2 block_3 */
		if ssa_14 {
			block block_2:
			/* preds: block_1 */
			break
			/* succs: block_8 */
		} else {
			block block_3:
			/* preds: block_1 */
			/* succs: block_4 */
		}
		block block_4:
		/* preds: block_3 */
		vec1 32 ssa_15 = ilt ssa_6, ssa_12
		/* succs: block_5 block_6 */
		if ssa_15 {
			block block_5:
			/* preds: block_4 */
			vec1 32 ssa_16 = iadd ssa_12, ssa_7
			vec1 32 ssa_17 = load_const (0x3f800000 /* 1.000000 */)
			continue
			/* succs: block_1 */
		} else {
			block block_6:
			/* preds: block_4 */
			/* succs: block_7 */
		}
		block block_7:
		/* preds: block_6 */
		vec1 32 ssa_18 = iadd ssa_12, ssa_7
		vec1 32 ssa_19 = load_const (0x3f800000 /* 1.000000 */)
		/* succs: block_1 */
	}

So all we need to do is move everything after the if into the else block and remove the continue. Removing myself as assignee, this would probably be a good beginners task.
Comment 12 Danylo 2018-11-13 13:45:56 UTC
(In reply to Timothy Arceri from comment #11)
> 
> So all we need to do is move everything after the if into the else block and
> remove the continue. Removing myself as assignee, this would probably be a
> good beginners task.
Hi,

I've tried to do this and it works for me however it alone doesn't solve the problem.

Consider the resulting nir:

        loop {
                block block_1:
                /* preds: block_0 block_7 */
                vec1 32 ssa_8 = phi block_0: ssa_4, block_7: ssa_20
                vec1 32 ssa_9 = phi block_0: ssa_0, block_7: ssa_4
                vec1 32 ssa_10 = phi block_0: ssa_1, block_7: ssa_4
                vec1 32 ssa_11 = phi block_0: ssa_2, block_7: ssa_21
                vec1 32 ssa_12 = phi block_0: ssa_3, block_7: ssa_22
                vec4 32 ssa_13 = vec4 ssa_12, ssa_11, ssa_10, ssa_9
                vec1 32 ssa_14 = ige ssa_8, ssa_5
                /* succs: block_2 block_3 */
                if ssa_14 {
                        block block_2:
                        /* preds: block_1 */
                        break
                        /* succs: block_8 */
                } else {
                        block block_3:
                        /* preds: block_1 */
                        /* succs: block_4 */
                }
                block block_4:
                /* preds: block_3 */
                vec1 32 ssa_15 = ilt ssa_6, ssa_8
                /* succs: block_5 block_6 */
                if ssa_15 {
                        block block_5:
                        /* preds: block_4 */
                        vec1 32 ssa_16 = iadd ssa_8, ssa_7
                        vec1 32 ssa_17 = load_const (0x3f800000 /* 1.000000 */)
                        /* succs: block_7 */
                } else {
                        block block_6:
                        /* preds: block_4 */
                        vec1 32 ssa_18 = iadd ssa_8, ssa_7
                        vec1 32 ssa_19 = load_const (0x3f800000 /* 1.000000 */)
                        /* succs: block_7 */
                }
                block block_7:
                /* preds: block_5 block_6 */
                vec1 32 ssa_20 = phi block_5: ssa_16, block_6: ssa_18
                vec1 32 ssa_21 = phi block_5: ssa_17, block_6: ssa_4
                vec1 32 ssa_22 = phi block_5: ssa_4, block_6: ssa_19
                /* succs: block_1 */
        }

Now in both "if" (block_5) and "else" (block_6) blocks there are identical expressions and no continue. 
However there is no optimization to pull these identical expressions out - CSE pass won't do this since it's a local CSE, only global CSE would help.
And there is no active global CSE pass, there is only Global Code Motion pass with Global Value Numbering and it is not enabled, enabling it optimizes the condition in question and in the end whole loop disappears as expected, however this pass doesn't look something we want since in other cases it hurts shaders and it is more than just global CSE.

Any opinions on this?
Comment 13 Timothy Arceri 2018-11-21 23:53:11 UTC
(In reply to Danylo from comment #12)
> (In reply to Timothy Arceri from comment #11)
> > 
> > So all we need to do is move everything after the if into the else block and
> > remove the continue. Removing myself as assignee, this would probably be a
> > good beginners task.
> Hi,
> 
> I've tried to do this and it works for me however it alone doesn't solve the
> problem.
> 
> Consider the resulting nir:
> 
>         loop {
>                 block block_1:
>                 /* preds: block_0 block_7 */
>                 vec1 32 ssa_8 = phi block_0: ssa_4, block_7: ssa_20
>                 vec1 32 ssa_9 = phi block_0: ssa_0, block_7: ssa_4
>                 vec1 32 ssa_10 = phi block_0: ssa_1, block_7: ssa_4
>                 vec1 32 ssa_11 = phi block_0: ssa_2, block_7: ssa_21
>                 vec1 32 ssa_12 = phi block_0: ssa_3, block_7: ssa_22
>                 vec4 32 ssa_13 = vec4 ssa_12, ssa_11, ssa_10, ssa_9
>                 vec1 32 ssa_14 = ige ssa_8, ssa_5
>                 /* succs: block_2 block_3 */
>                 if ssa_14 {
>                         block block_2:
>                         /* preds: block_1 */
>                         break
>                         /* succs: block_8 */
>                 } else {
>                         block block_3:
>                         /* preds: block_1 */
>                         /* succs: block_4 */
>                 }
>                 block block_4:
>                 /* preds: block_3 */
>                 vec1 32 ssa_15 = ilt ssa_6, ssa_8
>                 /* succs: block_5 block_6 */
>                 if ssa_15 {
>                         block block_5:
>                         /* preds: block_4 */
>                         vec1 32 ssa_16 = iadd ssa_8, ssa_7
>                         vec1 32 ssa_17 = load_const (0x3f800000 /* 1.000000
> */)
>                         /* succs: block_7 */
>                 } else {
>                         block block_6:
>                         /* preds: block_4 */
>                         vec1 32 ssa_18 = iadd ssa_8, ssa_7
>                         vec1 32 ssa_19 = load_const (0x3f800000 /* 1.000000
> */)
>                         /* succs: block_7 */
>                 }
>                 block block_7:
>                 /* preds: block_5 block_6 */
>                 vec1 32 ssa_20 = phi block_5: ssa_16, block_6: ssa_18
>                 vec1 32 ssa_21 = phi block_5: ssa_17, block_6: ssa_4
>                 vec1 32 ssa_22 = phi block_5: ssa_4, block_6: ssa_19
>                 /* succs: block_1 */
>         }
> 
> Now in both "if" (block_5) and "else" (block_6) blocks there are identical
> expressions and no continue. 
> However there is no optimization to pull these identical expressions out -
> CSE pass won't do this since it's a local CSE, only global CSE would help.
> And there is no active global CSE pass, there is only Global Code Motion
> pass with Global Value Numbering and it is not enabled, enabling it
> optimizes the condition in question and in the end whole loop disappears as
> expected, however this pass doesn't look something we want since in other
> cases it hurts shaders and it is more than just global CSE.
> 
> Any opinions on this?

None of that should matter. If the continue if removed there should be nothing stopping the loop from unrolling, and if the loop is unrolled the both ifs should be able to be optimised away (assuming I'm reading the IR correctly). Is this not what you are seeing?
Comment 14 Danylo 2018-11-22 13:29:11 UTC
(In reply to Timothy Arceri from comment #13)
> 
> None of that should matter. If the continue if removed there should be
> nothing stopping the loop from unrolling, and if the loop is unrolled the
> both ifs should be able to be optimised away (assuming I'm reading the IR
> correctly). Is this not what you are seeing?

Unfortunately not, loop isn't unrolled.

To be on the same page the optimization I did is turning

   loop {
       ...
       if (cond) {
          do_work_1();
          continue;
       } else {
       }
       do_work_2();
    }

into:

    loop {
       ...
       if (cond) {
          do_work_1();
       } else {
          do_work_2();
       }
    }

So in our case it effectively produces:

       ...
       if (cond) {
          i++;
          do_work_1();
       } else {
          i++;
          do_work_2();
       }
       ...

Looks like in previous comment I forgot to say that both branches have 'i++'.

Loop with such condition couldn't be unrolled because 'compute_induction_information' could not find induction variable because 'i' is in a control flow

> /* If one of the sources is in a conditional or nested block then
>  * panic.
>  */
> if (src_var->in_control_flow)
>    break;

To make loop unrollable 'i++' should be outside of conditional block and there is no optimization that could pull it out as I wrote in previous comment.
Comment 15 Danylo 2018-11-22 13:32:06 UTC
Created attachment 142567 [details] [review]
Removing unnecessary continue

Optimization in question.
Comment 16 Timothy Arceri 2018-11-28 03:31:26 UTC
(In reply to Danylo from comment #15)
> Created attachment 142567 [details] [review] [review]
> Removing unnecessary continue
> 
> Optimization in question.

Thanks! I've tried a few variations of this patch and eventually settled on a v2 which I sent out with an update to loop analysis so that we can detect the duplicate additions.

https://patchwork.freedesktop.org/series/53128/
Comment 17 Timothy Arceri 2018-12-13 00:00:41 UTC
Should be fixed as of:

commit 9e6b39e1d521aa723749a47d958d58900bf25138 (HEAD -> master, origin/master, origin/HEAD)
Author: Timothy Arceri <tarceri@itsqueeze.com>
Date:   Mon Nov 26 12:05:00 2018 +1100

    nir: detect more induction variables
    
    This allows loop analysis to detect inductions variables that
    are incremented in both branches of an if rather than in a main
    loop block. For example:
    
       loop {
          block block_1:
          /* preds: block_0 block_7 */
          vec1 32 ssa_8 = phi block_0: ssa_4, block_7: ssa_20
          vec1 32 ssa_9 = phi block_0: ssa_0, block_7: ssa_4
          vec1 32 ssa_10 = phi block_0: ssa_1, block_7: ssa_4
          vec1 32 ssa_11 = phi block_0: ssa_2, block_7: ssa_21
          vec1 32 ssa_12 = phi block_0: ssa_3, block_7: ssa_22
          vec4 32 ssa_13 = vec4 ssa_12, ssa_11, ssa_10, ssa_9
          vec1 32 ssa_14 = ige ssa_8, ssa_5
          /* succs: block_2 block_3 */
          if ssa_14 {
             block block_2:
             /* preds: block_1 */
             break
             /* succs: block_8 */
          } else {
             block block_3:
             /* preds: block_1 */
             /* succs: block_4 */
          }
          block block_4:
          /* preds: block_3 */
          vec1 32 ssa_15 = ilt ssa_6, ssa_8
          /* succs: block_5 block_6 */
          if ssa_15 {
             block block_5:
             /* preds: block_4 */
             vec1 32 ssa_16 = iadd ssa_8, ssa_7
             vec1 32 ssa_17 = load_const (0x3f800000 /* 1.000000*/)
             /* succs: block_7 */
          } else {
             block block_6:
             /* preds: block_4 */
             vec1 32 ssa_18 = iadd ssa_8, ssa_7
             vec1 32 ssa_19 = load_const (0x3f800000 /* 1.000000*/)
             /* succs: block_7 */
          }
          block block_7:
          /* preds: block_5 block_6 */
          vec1 32 ssa_20 = phi block_5: ssa_16, block_6: ssa_18
          vec1 32 ssa_21 = phi block_5: ssa_17, block_6: ssa_4
          vec1 32 ssa_22 = phi block_5: ssa_4, block_6: ssa_19
          /* succs: block_1 */
       }
    
    Unfortunatly GCM could move the addition out of the if for us
    (making this patch unrequired) but we still cannot enable the GCM
    pass without regressions.
    
    This unrolls a loop in Rise of The Tomb Raider.
    
    vkpipeline-db results (VEGA):
    
    Totals from affected shaders:
    SGPRS: 88 -> 96 (9.09 %)
    VGPRS: 56 -> 52 (-7.14 %)
    Spilled SGPRs: 0 -> 0 (0.00 %)
    Spilled VGPRs: 0 -> 0 (0.00 %)
    Private memory VGPRs: 0 -> 0 (0.00 %)
    Scratch size: 0 -> 0 (0.00 %) dwords per thread
    Code Size: 2168 -> 4560 (110.33 %) bytes
    LDS: 0 -> 0 (0.00 %) blocks
    Max Waves: 4 -> 4 (0.00 %)
    Wait states: 0 -> 0 (0.00 %)
    
    Reviewed-by: Thomas Helland <thomashelland90@gmail.com>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=32211

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.