106774 – GLSL IR copy propagates loads of SSBOs

Bug 106774 - GLSL IR copy propagates loads of SSBOs

Summary: GLSL IR copy propagates loads of SSBOs

Status:	RESOLVED FIXED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	glsl-compiler (show other bugs)
Version:	git
Hardware:	Other All

Importance:	medium normal
Assignee:	mesa-dev
QA Contact:	Intel 3D Bugs Mailing List

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2018-06-01 20:34 UTC by Ian Romanick
Modified:	2018-06-15 09:16 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:
i915 features:

Attachments
Test case that exhibits the issue (1.86 KB, text/plain) 2018-06-01 20:34 UTC, Ian Romanick	Details
View All

Description Ian Romanick 2018-06-01 20:34:42 UTC

Created attachment 139956 [details]
Test case that exhibits the issue

If a value loaded from an SSBO is used multiple times, the ssbo_load gets copy propagated to each use.  SSBO loads are expensive, so, while this may be technically correct, it is also bad.

Arguments can be made that the test case is not correct, but that is an orthogonal issue.  The core of the attached test case is:

    layout(binding = 0) buffer bufblock {
        int value;
    };

...

        int f;

        /* This is an open-coded atomicAdd. */
        do {
            f = value;
        } while (f != atomicCompSwap(value, f, f + 4));

For this code, the resulting NIR is (note the four load_ssbo operations):

        loop {
                block block_1:
                /* preds: block_0 block_4 */
                vec1 32 ssa_4 = load_const (0x00000001 /* 0.000000 */)
                vec1 32 ssa_5 = iadd ssa_0, ssa_4
                vec1 32 ssa_6 = intrinsic load_ssbo (ssa_5, ssa_0) () ()
                vec1 32 ssa_7 = load_const (0x00000001 /* 0.000000 */)
                vec1 32 ssa_8 = iadd ssa_0, ssa_7
                vec1 32 ssa_9 = intrinsic load_ssbo (ssa_8, ssa_0) () ()
                vec1 32 ssa_10 = load_const (0x00000001 /* 0.000000 */)
                vec1 32 ssa_11 = iadd ssa_0, ssa_10
                vec1 32 ssa_12 = intrinsic load_ssbo (ssa_11, ssa_0) () ()
                vec1 32 ssa_13 = iadd ssa_12, ssa_1
                vec1 32 ssa_14 = load_const (0x00000001 /* 0.000000 */)
                vec1 32 ssa_15 = iadd ssa_0, ssa_14
                vec1 32 ssa_16 = intrinsic ssbo_atomic_comp_swap (ssa_15, ssa_0, ssa_9, ssa_13) () ()
                vec1 32 ssa_17 = load_const (0x00000001 /* 0.000000 */)
                vec1 32 ssa_18 = iadd ssa_0, ssa_17
                vec1 32 ssa_19 = intrinsic load_ssbo (ssa_18, ssa_0) () ()
                vec1 32 ssa_20 = ieq ssa_19, ssa_16
                /* succs: block_2 block_3 */
                if ssa_20 {
                        block block_2:
                        /* preds: block_1 */
                        break
                        /* succs: block_5 */
                } else {
                        block block_3:
                        /* preds: block_1 */
                        /* succs: block_4 */
                }
                block block_4:
                /* preds: block_3 */
                /* succs: block_1 */
        }

The multiple reads get a mishmosh of values (written by other threads), and 

Annotating 'value' with volatile helps, but the generated code still wrong.  There should definitely *NOT* be a load_ssbo after the ssbo_atomic_comp_swap!

        loop {
                block block_1:
                /* preds: block_0 block_4 */
                vec1 32 ssa_4 = intrinsic load_ssbo (ssa_2, ssa_0) () ()
                vec1 32 ssa_5 = intrinsic load_ssbo (ssa_2, ssa_0) () ()
                vec1 32 ssa_6 = intrinsic load_ssbo (ssa_2, ssa_0) () ()
                vec1 32 ssa_7 = iadd ssa_6, ssa_1
                vec1 32 ssa_8 = intrinsic ssbo_atomic_comp_swap (ssa_2, ssa_0, ssa_5, ssa_7) () ()
                vec1 32 ssa_9 = intrinsic load_ssbo (ssa_2, ssa_0) () ()
                vec1 32 ssa_10 = ieq ssa_9, ssa_8
                /* succs: block_2 block_3 */
                if ssa_10 {
                        block block_2:
                        /* preds: block_1 */
                        break
                        /* succs: block_5 */
                } else {
                        block block_3:
                        /* preds: block_1 */
                        /* succs: block_4 */
                }
                block block_4:
                /* preds: block_3 */
                /* succs: block_1 */
        }

Completely re-writing the loop as

	int f = value;
	int a;
	do {
		a = f;
	} while ((f = atomicCompSwap(value, f, f + 4)) != a);

produces the expected results:

        vec1 32 ssa_4 = intrinsic load_ssbo (ssa_2, ssa_0) () ()
        /* succs: block_1 */
        loop {
                block block_1:
                /* preds: block_0 block_4 */
                vec1 32 ssa_5 = phi block_0: ssa_4, block_4: ssa_7
                vec1 32 ssa_6 = iadd ssa_5, ssa_1
                vec1 32 ssa_7 = intrinsic ssbo_atomic_comp_swap (ssa_2, ssa_0, ssa_5, ssa_6) () ()
                vec1 32 ssa_8 = ieq ssa_7, ssa_5
                /* succs: block_2 block_3 */
                if ssa_8 {
                        block block_2:
                        /* preds: block_1 */
                        break
                        /* succs: block_5 */
                } else {
                        block block_3:
                        /* preds: block_1 */
                        /* succs: block_4 */
                }
                block block_4:
                /* preds: block_3 */
                /* succs: block_1 */
        }

Comment 1 Ian Romanick 2018-06-05 19:42:14 UTC

Upon further inspection, it turns out this happens before NIR even sees it.

Comment 2 Ilia Mirkin 2018-06-05 19:59:10 UTC

Are you sure that's illegal? I'd imagine that you'd need that variable to be marked as coherent, or to introduce a compiler barrier somehow.

One could also argue that an atomic should introduce such a barrier explicitly -- I think in C they tend to act as compiler barriers.

Comment 3 Ian Romanick 2018-06-05 22:08:31 UTC

(In reply to Ilia Mirkin from comment #2)
> Are you sure that's illegal? I'd imagine that you'd need that variable to be
> marked as coherent, or to introduce a compiler barrier somehow.

Legality isn't really relevant.  The shader source contains one instance of a very, very expensive operation, and we decide to do it four times.  That's unacceptable.

Comment 4 Ilia Mirkin 2018-06-05 22:21:35 UTC

(In reply to Ian Romanick from comment #3)
> (In reply to Ilia Mirkin from comment #2)
> > Are you sure that's illegal? I'd imagine that you'd need that variable to be
> > marked as coherent, or to introduce a compiler barrier somehow.
> 
> Legality isn't really relevant.  The shader source contains one instance of
> a very, very expensive operation, and we decide to do it four times.  That's
> unacceptable.

Oh wait. I totally misread. Copy propagate. And it does so ACROSS an atomic, no less! That's definitely broken

Comment 5 Jason Ekstrand 2018-06-05 22:58:31 UTC

I'd be a fan of just making GLSL IR bail in copy propagation of anything that isn't a local, global, or uniform.  We could try to make it smart enough to handle barriers correctly with SSBOs and shared variables but that's very hard to get right.

Comment 6 Jason Ekstrand 2018-06-05 22:59:22 UTC

Also, please turn your test case there into a piglit test.  Extra points if it's a CTS test so we force everyone else to be correct too. :-)

Comment 7 Ian Romanick 2018-06-05 23:07:13 UTC

(In reply to Jason Ekstrand from comment #6)
> Also, please turn your test case there into a piglit test.  Extra points if
> it's a CTS test so we force everyone else to be correct too. :-)

I'm not sure if you looked at it, but it is a piglit shader_runner test.  The current failure mode is many (recoverable) GPU hangs.

Comment 8 Ian Romanick 2018-06-05 23:07:47 UTC

(In reply to Ian Romanick from comment #7)
> (In reply to Jason Ekstrand from comment #6)
> > Also, please turn your test case there into a piglit test.  Extra points if
> > it's a CTS test so we force everyone else to be correct too. :-)
> 
> I'm not sure if you looked at it, but it is a piglit shader_runner test. 
> The current failure mode is many (recoverable) GPU hangs.

Which I should point out are caused by infinite loops in the shader... not some other problem. :)

Comment 9 Jason Ekstrand 2018-06-05 23:14:32 UTC

(In reply to Ian Romanick from comment #8)
> (In reply to Ian Romanick from comment #7)
> > (In reply to Jason Ekstrand from comment #6)
> > > Also, please turn your test case there into a piglit test.  Extra points if
> > > it's a CTS test so we force everyone else to be correct too. :-)
> > 
> > I'm not sure if you looked at it, but it is a piglit shader_runner test. 
> > The current failure mode is many (recoverable) GPU hangs.

I figured.  What other kind of simple shader test is there?  That was more of a "Once you fix it, please commit to piglit".

> Which I should point out are caused by infinite loops in the shader... not
> some other problem. :)

Those are very good source of hangs.  They're actually debuggable. :)

Comment 10 Michel Dänzer 2018-06-06 07:25:00 UTC

(In reply to Jason Ekstrand from comment #9)
> > > I'm not sure if you looked at it, but it is a piglit shader_runner test. 
> > > The current failure mode is many (recoverable) GPU hangs.
> 
> I figured.  What other kind of simple shader test is there?  That was more
> of a "Once you fix it, please commit to piglit".

Please don't add piglit tests which are expected to hang the GPU, or at least don't set them up to be run by default. Not all drivers can reliably recover from GPU hangs.

Comment 11 Caio Marcelo de Oliveira Filho 2018-06-06 21:21:01 UTC

Even with the fixes to the GLSL copy propagation (and other passes) logic, the shader was still hanging. After a debugging session with Jason, he figured that the helper invocations were not making progress, getting into an infinite loop.

While the reads of value give correct values, the result from atomicCompareSwap is undefined for helper invocations, and comparing two was always failing.

The alternative version of the test in the report (the one that works), avoids this issue since the return value for atomicCompareSwap seems to be always the same (although not something we should rely on, I guess), allowing the loop to end.

Early returning based on gl_HelperInvocation makes things work with the original loop, but depends on newer GL version.

Comment 12 almos 2018-06-06 23:39:28 UTC

(In reply to Michel Dänzer from comment #10)
> Please don't add piglit tests which are expected to hang the GPU, or at
> least don't set them up to be run by default. Not all drivers can reliably
> recover from GPU hangs.

IMHO those drivers should be fixed.

Comment 13 Eero Tamminen 2018-06-07 10:14:37 UTC

(In reply to almos from comment #12)
> (In reply to Michel Dänzer from comment #10)
> > Please don't add piglit tests which are expected to hang the GPU, or at
> > least don't set them up to be run by default. Not all drivers can reliably
> > recover from GPU hangs.
>
> IMHO those drivers should be fixed.

Hang recovery is kernel functionality, not Mesa.  There can be Mesa drivers for HW with experimental kernel drivers, which don't have reliable hang recovery yet (and HW could be buggy too).

It would be interesting to know which of the current (kernel) drivers have known non-recoverable GPU hangs though.

Comment 14 almos 2018-06-07 14:33:50 UTC

(In reply to Eero Tamminen from comment #13)
> Hang recovery is kernel functionality, not Mesa.  There can be Mesa drivers
> for HW with experimental kernel drivers, which don't have reliable hang
> recovery yet (and HW could be buggy too).
> 
> It would be interesting to know which of the current (kernel) drivers have
> known non-recoverable GPU hangs though.

A few examples: radeon, amdgpu, nouveau.

Comment 15 Jason Ekstrand 2018-06-07 16:33:36 UTC

(In reply to Michel Dänzer from comment #10)
> (In reply to Jason Ekstrand from comment #9)
> > > > I'm not sure if you looked at it, but it is a piglit shader_runner test. 
> > > > The current failure mode is many (recoverable) GPU hangs.
> > 
> > I figured.  What other kind of simple shader test is there?  That was more
> > of a "Once you fix it, please commit to piglit".
> 
> Please don't add piglit tests which are expected to hang the GPU, or at
> least don't set them up to be run by default. Not all drivers can reliably
> recover from GPU hangs.

The intention is not to add a test that hangs.  The two required fixes are to the test and GLSL IR.  Once those are made, it shouldn't hang for anyone.

As far as general policy goes, no we don't want to have GPU hangs.  Maybe we could even go so far to say that if you have a test which has a high probability of hanging the GPU, it shouldn't be added to the test list by default.  However there are many kinds of tests which might hang the GPU if the driver has a bug as no fault of the test.  As someone who works on a very hangy GPU, I'm keenly aware of this. :-)  That doesn't mean we shouldn't land those tests; it means we should fix those drivers.

Comment 16 Caio Marcelo de Oliveira Filho 2018-06-12 19:02:12 UTC

(In reply to Caio Marcelo de Oliveira Filho from comment #11)
> Even with the fixes to the GLSL copy propagation (and other passes) logic,
> the shader was still hanging. After a debugging session with Jason, he
> figured that the helper invocations were not making progress, getting into
> an infinite loop.

Per requested created https://bugs.freedesktop.org/show_bug.cgi?id=106902 for this particular issue with helper invocations.

Comment 17 Ian Romanick 2018-06-12 23:55:17 UTC

I have sent fixes to the mesa-dev list:

https://patchwork.freedesktop.org/series/44661/

And an updated version of the test case (that avoids the helper pixels problem) to the piglit list:

https://patchwork.freedesktop.org/patch/229132/

Comment 18 Ian Romanick 2018-06-14 18:36:06 UTC

This bug should be fixed by the commits listed below.

(In reply to Michel Dänzer from comment #10)
> Please don't add piglit tests which are expected to hang the GPU, or at
> least don't set them up to be run by default. Not all drivers can reliably
> recover from GPU hangs.

The updated test does not cause GPU hangs with these patches applied.  Are you ok with the test landing in piglit master?



commit 37bd9ccd21b860d2b5ffea7e1f472ec83b68b43b (HEAD -> bug-106774, origin/master, origin/HEAD)
Author: Ian Romanick <ian.d.romanick@intel.com>
Date:   Tue Jun 5 16:02:25 2018 -0700

    glsl: Don't copy propagate elements from SSBO or shared variables either
    
    Since SSBOs can be written by a different GPU thread, copy propagating a
    read can cause the value to magically change.  SSBO reads are also very
    expensive, so doing it twice will be slower.
    
    The same shader was helped by this patch and the previous.
    
    Haswell, Broadwell, and Skylake had similar results. (Skylake shown)
    total instructions in shared programs: 14399119 -> 14399113 (<.01%)
    instructions in affected programs: 683 -> 677 (-0.88%)
    helped: 1
    HURT: 0
    
    total cycles in shared programs: 532973113 -> 532971865 (<.01%)
    cycles in affected programs: 524666 -> 523418 (-0.24%)
    helped: 1
    HURT: 0
    
    Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
    Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
    Cc: mesa-stable@lists.freedesktop.org
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106774

commit 461a5c899c08064467abb635536381a5a5659280
Author: Ian Romanick <ian.d.romanick@intel.com>
Date:   Tue Jun 5 15:04:24 2018 -0700

    glsl: Don't copy propagate from SSBO or shared variables either
    
    Since SSBOs can be written by other GPU threads, copy propagating a read
    can cause the value to magically change.  SSBO reads are also very
    expensive, so doing it twice will be slower.
    
    Haswell, Broadwell, and Skylake had similar results. (Skylake shown)
    total instructions in shared programs: 14399120 -> 14399119 (<.01%)
    instructions in affected programs: 684 -> 683 (-0.15%)
    helped: 1
    HURT: 0
    
    total cycles in shared programs: 532978931 -> 532973113 (<.01%)
    cycles in affected programs: 530484 -> 524666 (-1.10%)
    helped: 1
    HURT: 0
    
    Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
    Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
    Cc: mesa-stable@lists.freedesktop.org
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106774

Comment 19 Michel Dänzer 2018-06-15 07:25:23 UTC

(In reply to Ian Romanick from comment #18)
> The updated test does not cause GPU hangs with these patches applied.  Are
> you ok with the test landing in piglit master?

Sure, thanks.

Comment 20 Iago Toral 2018-06-15 09:16:02 UTC

FWIW, this sounds a lot like the problem we tried to address in NIR with a load-combine pass some time ago here:

https://lists.freedesktop.org/archives/mesa-dev/2016-April/112925.html

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.