Bug 93599

Summary: Strange green flashes with "Metro: Last Light Redux" + "Metro 2033 Redux" with Intel Mesa driver
Product: Mesa Reporter: Darius Spitznagel <d.spitznagel>
Component: Drivers/DRI/i965Assignee: Francisco Jerez <currojerez>
Status: RESOLVED DUPLICATE QA Contact: Intel 3D Bugs Mailing List <intel-3d-bugs>
Severity: normal    
Priority: medium CC: currojerez, idr
Version: git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: gen7_stall_before_ro_invalidation.patch
gen7_stall_before_ro_invalidation-2.patch
attachment-13944-0.html

Description Darius Spitznagel 2016-01-05 22:15:17 UTC
Hello devs,

today I tested "Metro: Last Light Redux" and "Metro 2033 Redux" with current mesa git.
When I play that games with kernel lower 4.2.x (4.1.15 on my side) there are no green flashes. Rendering looks OK.
But with a higher kernel than 4.1.x (4.3.3 on my side) there are strange green flashes appearing (randomly I think) while rendering.

This problem is reproducible.

I have made an apitrace (from Metro: LL Redux) which you can download here...
http://www.goodbytez.de/mesa/metro.trace.tar.xz

As already stated above, replaying this with kernel 4.1.15 is OK and starting with kernel 4.2.x there are these green flashes. There are also some black glitches you can see best at the end of the replay before the steel gate opens.

The apitrace replay with kernel 4.3.3 shows following problems that catched my eyes. Metro: LL compiles lots of tiny shaders...

At the very beginning...

<<<
1500: message: major shader compiler error 4: 0:16(1): error: #extension directive is not allowed in the middle of a shader
1500 @0 glCompileShader(shader = 440)
1500: warning: compilation failed
1500: warning: 0:16(1): error: #extension directive is not allowed in the middle of a shader
>>>

Later...

<<<
334307: message: shader compiler issue 17597: VS vec4 shader: 46 inst, 0 loops, 204 cycles, compacted 736 to 624 bytes.
334348 @0 glLinkProgram(program = 6704)
334348: warning: link failed
334348: warning: error: linking with uncompiled shader
>>>

More later...

<<<
342445: message: api performance issue 17829: Exceeded state cache size limit.  Clearing the set of compiled programs, which will trigger recompiles
>>>

From this point on the replay starts recompiling vertex and fragment shaders...

<<<
342484: message: api performance issue 36: Recompiling vertex shader for program 5429
342484: message: api performance issue 37:   Didn't find previous compile in the shader cache for debug
342484: message: shader compiler issue 17831: FS SIMD8 shader: 156 inst, 0 loops, 738 cycles, 0:0 spills:fills, Promoted 9 constants, compacted 2496 to 1872 bytes.
342484: message: shader compiler issue 17832: FS SIMD16 shader: 157 inst, 0 loops, 1020 cycles, 0:0 spills:fills, Promoted 9 constants, compacted 2512 to 1888 bytes.
342484: message: api performance issue 40: Recompiling fragment shader for program 5429
342484: message: api performance issue 41:   Didn't find previous compile in the shader cache for debug
>>>

The above problems appear all on the loading screen.

When rendering starts you can see these ones...

<<<
494856: message: api performance issue 18266: GTT mapping a busy miptree BO stalled and took 11.057 ms.
523046: message: api performance issue 18266: GTT mapping a busy miptree BO stalled and took 0.754 ms.
579096: message: api performance issue 18266: GTT mapping a busy miptree BO stalled and took 2.064 ms.
606992: message: api performance issue 18266: GTT mapping a busy miptree BO stalled and took 1.608 ms.
>>>

And one or two...

<<<
7450104: message: api performance issue 18474: CPU mapping a busy MapBufferRange BO stalled and took 0.889 ms.
>>>

The replay with kernel 4.1.15 shows much the same problems, but when it comes to rendering there are NO green flashes.

I would be very happy if someone could take a look into this. Having an older kernel installed for 2 games doesn't make sense.

My System is Debian 8.2 x64
libdrm 2.4.66 from git master
xserver-xorg-video-intel 2.99.917+git20151217
xserver-xorg-core 1.16.4

Mesa:
glxinfo | grep OpenGL
OpenGL vendor string: Intel Open Source Technology Center
OpenGL renderer string: Mesa DRI Intel(R) Haswell Desktop 
OpenGL core profile version string: 3.3 (Core Profile) Mesa 11.2.0-devel (git-cf84977)
OpenGL core profile shading language version string: 3.30
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile
OpenGL core profile extensions:
OpenGL version string: 3.0 Mesa 11.2.0-devel
OpenGL shading language version string: 1.30
OpenGL context flags: (none)
OpenGL extensions:
OpenGL ES profile version string: OpenGL ES 3.0 Mesa 11.2.0-devel
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.00
OpenGL ES profile extensions:

Steam start options:
MESA_GL_VERSION_OVERRIDE=4.0 MESA_GLSL_VERSION_OVERRIDE=400 %command%
Comment 1 Darius Spitznagel 2016-01-05 22:39:31 UTC
I forgot to mention that my CPU/GPU is an Intel(R) Core(TM) i7-4770R CPU, Iris Pro

Kind regards
Darius
Comment 2 Ian Romanick 2016-01-06 17:24:51 UTC
(In reply to Darius Spitznagel from comment #0)
> <<<
> 1500: message: major shader compiler error 4: 0:16(1): error: #extension
> directive is not allowed in the middle of a shader
> 1500 @0 glCompileShader(shader = 440)
> 1500: warning: compilation failed
> 1500: warning: 0:16(1): error: #extension directive is not allowed in the
> middle of a shader
> >>>

Oh bother. :(  This is an application bug, and it can be worked around by setting the environment variable allow_glsl_extension_directive_midshader=true.  You can also set it in your driconf.

> Later...
> 
> <<<
> 334307: message: shader compiler issue 17597: VS vec4 shader: 46 inst, 0
> loops, 204 cycles, compacted 736 to 624 bytes.
> 334348 @0 glLinkProgram(program = 6704)
> 334348: warning: link failed
> 334348: warning: error: linking with uncompiled shader
> >>>

And this is surely caused by the previous message.  The shader that generated the '#extension directives..." message didn't compile, and this error is triggered when the application tries to link the failed shader.  The work-around for the previous problem should eliminate this message.

Once this is resolved, if the green flashes persist, can you see if you can reproduce the problem on upstream kernels?  If you can, is there any chance you could bisect?

> More later...
> 
> <<<
> 342445: message: api performance issue 17829: Exceeded state cache size
> limit.  Clearing the set of compiled programs, which will trigger recompiles
> >>>

This and the other messages are just performance hints for developers (application and driver).  You can ignore these.
Comment 3 Darius Spitznagel 2016-01-06 23:19:49 UTC
(In reply to Ian Romanick from comment #2)
Hello Ian,
you are right. Exporting allow_glsl_extension_directive_midshader=true resolved the first two problems/errors, but it didn't solve the green flashes nor the black glitches.

I took your advice and installed newest kernel as of today (it's 4.4-rc8), but this also didn't solve it.

In the meantime I compiled Mesa 11.1.0 and tried both games on kernel 4.3.3 and 4.4-rc8 again...
> No green flashes nor black glitches > AHA!

OK, I have now a point where I can start to bisect.
I will do what I can to find the bad commit(s).

Thank you so far.
Comment 4 Darius Spitznagel 2016-01-07 03:07:16 UTC
228d5a3f75086e92aaa01619a55d6c8ac7841e0e is the first bad commit
commit 228d5a3f75086e92aaa01619a55d6c8ac7841e0e
Author: Francisco Jerez <currojerez@riseup.net>
Date:   Tue Dec 8 18:53:57 2015 +0200

    i965: Hook up L3 partitioning state atom.
    
    Reviewed-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
    Reviewed-by: Kristian Høgsberg  <krh@bitplanet.net>

:040000 040000 529ab4c31e8341f54ab57b6b0ae13617cb6af42e e85f211790f17b9b3285c304e144abaf7cb44092 M	src

Hope this helps!
Comment 5 Darius Spitznagel 2016-01-11 21:40:47 UTC
Hello Ian,

I dug a little bit more into this and found at the end of file src/mesa/drivers/dri/i965/gen7_l3_state.c this hack...

/**
 * Hack to restore the default L3 configuration.
 *
 * This will be called at the end of every batch in order to reset the L3
 * configuration to the default values for the time being until the kernel is
 * fixed.  Until kernel commit 6702cf16e0ba8b0129f5aa1b6609d4e9c70bc13b
 * (included in v4.1) we would set the MI_RESTORE_INHIBIT bit when submitting
 * batch buffers for the default context used by the DDX, which meant that any
 * context state changed by the GL would leak into the DDX, the assumption
 * being that the DDX would initialize any state it cares about manually.  The
 * DDX is however not careful enough to program an L3 configuration
 * explicitly, and it makes assumptions about it (URB size) which won't hold
 * and cause it to misrender if we let our L3 set-up to leak into the DDX.
 *
 * Since v4.1 of the Linux kernel the default context is saved and restored
 * normally, so it's far less likely for our L3 programming to interfere with
 * other contexts -- In fact restoring the default L3 configuration at the end
 * of the batch will be redundant most of the time.  A kind of state leak is
 * still possible though if the context making assumptions about L3 state is
 * created immediately after our context was active (e.g. without the DDX
 * default context being scheduled in between) because at present the DRM
 * doesn't fully initialize the contents of newly created contexts and instead
 * sets the MI_RESTORE_INHIBIT flag causing it to inherit the state from the
 * last active context.
 *
 * It's possible to realize such a scenario if, say, an X server (or a GL
 * application using an outdated non-L3-aware Mesa version) is started while
 * another GL application is running and happens to have modified the L3
 * configuration, or if no X server is running at all and a GL application
 * using a non-L3-aware Mesa version is started after another GL application
 * ran and modified the L3 configuration -- The latter situation can actually
 * be reproduced easily on IVB in our CI system.
 */
void
gen7_restore_default_l3_config(struct brw_context *brw)
{
   const struct brw_device_info *devinfo = brw->intelScreen->devinfo;
   /* For efficiency assume that the first entry of the array matches the
    * default configuration.
    */
   const struct brw_l3_config *const cfg = get_l3_configs(devinfo);
   assert(cfg == get_l3_config(devinfo,
                               get_default_l3_weights(devinfo, false, false)));

   if (cfg != brw->l3.config && brw->can_do_pipelined_register_writes) {
      setup_l3_config(brw, cfg);
      update_urb_size(brw, cfg);
      brw->l3.config = cfg;
   }
}

... when I change this to...

void
gen7_restore_default_l3_config(struct brw_context *brw) {}

...both games render correct with kernel 4.3.3 (tested).

But I am not sure if I disabled "L3 partitioning" at whole with this one or the hack which was/is my intension.
If I am right, maybe the autor (Francisco Jerez) could check this for correctness?
Otherwise I'm sorry.
Comment 6 Ian Romanick 2016-01-11 22:40:29 UTC
Reassigning to Curro, since he wrote the patch. :)
Comment 7 Francisco Jerez 2016-01-13 04:14:31 UTC
Created attachment 120998 [details] [review]
gen7_stall_before_ro_invalidation.patch

Does the attached patch fix the corruption for you?

Thanks.
Comment 8 Darius Spitznagel 2016-01-13 19:32:36 UTC
(In reply to Francisco Jerez from comment #7)
> Created attachment 120998 [details] [review] [review]
> gen7_stall_before_ro_invalidation.patch
> 
> Does the attached patch fix the corruption for you?
> 
> Thanks.

Yes, it does! Please push it to master before 11.2.0 final.
Thank you very much.

I would also appreciate it when both games would be added to Mesas included drirc like Dead Island, Second Life, etc. (see Comment 2 from Ian > allow_glsl_extension_directive_midshader=true).
Comment 9 Francisco Jerez 2016-01-14 03:33:48 UTC
Created attachment 121012 [details] [review]
gen7_stall_before_ro_invalidation-2.patch

Thanks for testing.  I'm attaching a (marginally) more optimal fix which avoids a redundant CS stall and HDC flush from the next pipe control, can you confirm whether it still gets rid of the corruption?

Thanks.
Comment 10 Darius Spitznagel 2016-01-14 14:16:00 UTC
(In reply to Francisco Jerez from comment #9)
> Created attachment 121012 [details] [review] [review]
> gen7_stall_before_ro_invalidation-2.patch
> 
> Thanks for testing.  I'm attaching a (marginally) more optimal fix which
> avoids a redundant CS stall and HDC flush from the next pipe control, can
> you confirm whether it still gets rid of the corruption?
> 
> Thanks.

This optimized patch also works - there is no corruption!

Thanks again.
Comment 11 Francisco Jerez 2016-01-15 02:44:08 UTC
(In reply to Darius Spitznagel from comment #10)
> (In reply to Francisco Jerez from comment #9)
> > Created attachment 121012 [details] [review] [review] [review]
> > gen7_stall_before_ro_invalidation-2.patch
> > 
> > Thanks for testing.  I'm attaching a (marginally) more optimal fix which
> > avoids a redundant CS stall and HDC flush from the next pipe control, can
> > you confirm whether it still gets rid of the corruption?
> > 
> > Thanks.
> 
> This optimized patch also works - there is no corruption!
> 
> Thanks again.

Cool, I just added your tested-by tag and sent the fix for review to the mesa-dev mailing list:

http://lists.freedesktop.org/archives/mesa-dev/2016-January/104968.html
Comment 12 Mark Janes 2016-01-15 09:39:42 UTC

*** This bug has been marked as a duplicate of bug 93540 ***
Comment 13 Darius Spitznagel 2016-01-15 13:58:00 UTC
Created attachment 121065 [details]
attachment-13944-0.html

On 01/15/2016 03:44 AM, bugzilla-daemon@freedesktop.org wrote:
>
> *Comment # 11 <https://bugs.freedesktop.org/show_bug.cgi?id=93599#c11> 
> on bug 93599 <https://bugs.freedesktop.org/show_bug.cgi?id=93599> from 
> Francisco Jerez <mailto:currojerez@riseup.net> *
> (In reply to Darius Spitznagel fromcomment #10 <show_bug.cgi?id=93599#c10>)
> > (In reply to Francisco Jerez fromcomment #9 <show_bug.cgi?id=93599#c9>) > > Created attachment 121012 [details] [review] 
> <attachment.cgi?id=121012> [details] 
> <attachment.cgi?id=121012&action=edit> [review] 
> <page.cgi?id=splinter.html&bug=93599&attachment=121012> [review] 
> [review] > > gen7_stall_before_ro_invalidation-2.patch > > > > Thanks 
> for testing. I'm attaching a (marginally) more optimal fix which > > 
> avoids a redundant CS stall and HDC flush from the next pipe control, 
> can > > you confirm whether it still gets rid of the corruption? > > > 
> > Thanks. > > This optimized patch also works - there is no 
> corruption! > > Thanks again.
>
> Cool, I just added your tested-by tag and sent the fix for review to the
> mesa-dev mailing list:
>
> http://lists.freedesktop.org/archives/mesa-dev/2016-January/104968.html
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Hello Francisco,
I think you forgot the rename PIPE_CONTROL_DATA_CACHE_INVALIDATE also in 
"src/mesa/drivers/dri/i965/brw_misc_state.c".
Comment 14 Francisco Jerez 2016-01-15 19:30:33 UTC
(In reply to Darius Spitznagel from comment #13)
> Created attachment 121065 [details]
> attachment-13944-0.html
> 
> On 01/15/2016 03:44 AM, bugzilla-daemon@freedesktop.org wrote:
> >
> > *Comment # 11 <https://bugs.freedesktop.org/show_bug.cgi?id=93599#c11> 
> > on bug 93599 <https://bugs.freedesktop.org/show_bug.cgi?id=93599> from 
> > Francisco Jerez <mailto:currojerez@riseup.net> *
> > (In reply to Darius Spitznagel fromcomment #10 <show_bug.cgi?id=93599#c10>)
> > > (In reply to Francisco Jerez fromcomment #9 <show_bug.cgi?id=93599#c9>) > > Created attachment 121012 [details] [review] [review] 
> > <attachment.cgi?id=121012> [details] 
> > <attachment.cgi?id=121012&action=edit> [review] 
> > <page.cgi?id=splinter.html&bug=93599&attachment=121012> [review] 
> > [review] > > gen7_stall_before_ro_invalidation-2.patch > > > > Thanks 
> > for testing. I'm attaching a (marginally) more optimal fix which > > 
> > avoids a redundant CS stall and HDC flush from the next pipe control, 
> > can > > you confirm whether it still gets rid of the corruption? > > > 
> > > Thanks. > > This optimized patch also works - there is no 
> > corruption! > > Thanks again.
> >
> > Cool, I just added your tested-by tag and sent the fix for review to the
> > mesa-dev mailing list:
> >
> > http://lists.freedesktop.org/archives/mesa-dev/2016-January/104968.html
> > ------------------------------------------------------------------------
> > You are receiving this mail because:
> >
> >   * You reported the bug.
> >
> Hello Francisco,
> I think you forgot the rename PIPE_CONTROL_DATA_CACHE_INVALIDATE also in 
> "src/mesa/drivers/dri/i965/brw_misc_state.c".

I committed the patch that introduces that invalidation yesterday, after I had written and sent for review the series fixing for your problem, so it wasn't there at that point -- But thanks for reminding me to update it ;)
Comment 15 Darius Spitznagel 2016-02-04 12:57:38 UTC
(In reply to Francisco Jerez from comment #14)
> I committed the patch that introduces that invalidation yesterday, after I
> had written and sent for review the series fixing for your problem, so it
> wasn't there at that point -- But thanks for reminding me to update it ;)
Yes you are right.

I wonder why your patches are still not in master;)
Comment 16 Francisco Jerez 2016-02-04 21:53:37 UTC
(In reply to Darius Spitznagel from comment #15)
> (In reply to Francisco Jerez from comment #14)
> > I committed the patch that introduces that invalidation yesterday, after I
> > had written and sent for review the series fixing for your problem, so it
> > wasn't there at that point -- But thanks for reminding me to update it ;)
> Yes you are right.
> 
> I wonder why your patches are still not in master;)

The patches are still missing a Reviewed-by, feel free to send a ping on the mailing list :)

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.