Bug 102177

Summary: [SKL] ES31-CTS.core.sepshaderobjs.StateInteraction fails sporadically
Product: Mesa Reporter: Scott D Phillips <scott.d.phillips>
Component: Drivers/DRI/i965Assignee: Timothy Arceri <t_arceri>
Status: RESOLVED FIXED QA Contact: Intel 3D Bugs Mailing List <intel-3d-bugs>
Severity: normal    
Priority: medium    
Version: git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:

Description Scott D Phillips 2017-08-12 00:52:45 UTC
Bisected to:

commit 3fe8d04a6d69fad98228be647ba9b250ea0e3a8b
Author: Timothy Arceri <timothy.arceri@collabora.com>
Date:   Mon Jan 9 16:13:27 2017 +1100

    mesa: don't always set _NEW_PROGRAM when linking

    We only need to set it when linking was successful and the program
    being linked is currently active.

    The programs_in_use mask is just used as a flag for now but in
    a future change we will use it to update the CurrentProgram array.

    V2: make sure to flush vertices before linking (suggested by Marek)

    Reviewed-by: Marek Olšák <marek.olsak@amd.com>

Fails infrequently, seems to be ~0.6% of executions on my system. Doing what naively looks to me like a revert on master doesn't fix it, although lots has changed around that code since then.
Comment 1 Timothy Arceri 2017-08-12 02:02:37 UTC
Just to confirm. This is only happening on SKL?
Comment 2 Scott D Phillips 2017-08-12 02:31:17 UTC
I also saw a failure for the test in CI on Haswell, but haven't confirmed the bisect there, or checked any other platforms.
Comment 3 Kenneth Graunke 2017-08-14 04:10:31 UTC
This isn't Tim's fault.

The problem is that the test tries to re-link a program which is currently bound for rendering...but with a bogus shader, which causes a link failure.  The GL specification (I can't remember where exactly) says that in this case, the previous program remains bound for rendering and is not destroyed until it's unbound.

When re-linking, we immediately trash the program...but because linking fails, and we don't get a new one...we end up leaving the old freed program bound.  So when you go to draw...you hit a use-after-free...which may work or fail depending on the malloc calls in the meantime.

You can easily see this problem by using 'valgrind'.  This is a nasty issue, and I'm not exactly sure what to do about it.
Comment 4 Timothy Arceri 2017-08-14 06:06:44 UTC
I took a quick look at this after talking to Ken. There seems to be a bunch of scenarios but I'm wondering if in at least some cases we are marking things as dirty when they shouldn't be.

For example:

==30669== Invalid read of size 4
==30669==    at 0x9532B94: gen6_upload_push_constants (gen6_constant_state.c:85)
==30669==    by 0x9529250: check_and_emit_atom (brw_state_upload.c:456)
==30669==    by 0x9529250: brw_upload_pipeline_state (brw_state_upload.c:570)
==30669==    by 0x9529250: brw_upload_render_state (brw_state_upload.c:592)
==30669==    by 0x951BAAC: brw_try_draw_prims (brw_draw.c:770)
==30669==    by 0x951BAAC: brw_draw_prims (brw_draw.c:862)
==30669==    by 0x931F0BD: vbo_validated_drawrangeelements (vbo_exec_array.c:925)
==30669==    by 0x931F84E: vbo_exec_DrawElements (vbo_exec_array.c:1075)
==30669==    by 0xFC744A: glcts::StateInteractionCase::iterate() (in /home/timothy/data/git_mesa/build-es-cts-new/cts/glcts)
==30669==    by 0xE19E75: es31cts::TestCaseWrapper::iterate(tcu::TestCase*) (in /home/timothy/data/git_mesa/build-es-cts-new/cts/glcts)
==30669==    by 0x27DD23F: tcu::TestSessionExecutor::iterateTestCase(tcu::TestCase*) (in /home/timothy/data/git_mesa/build-es-cts-new/cts/glcts)
==30669==    by 0x27DC794: tcu::TestSessionExecutor::iterate() (in /home/timothy/data/git_mesa/build-es-cts-new/cts/glcts)
==30669==    by 0x27B1F3E: tcu::App::iterate() (in /home/timothy/data/git_mesa/build-es-cts-new/cts/glcts)
==30669==    by 0xE14313: main (in /home/timothy/data/git_mesa/build-es-cts-new/cts/glcts)
==30669==  Address 0xa744580 is 48 bytes inside a block of size 64 free'd
==30669==    at 0x4C2ED4A: free (vg_replace_malloc.c:530)
==30669==    by 0x944C14C: unsafe_free (ralloc.c:269)
==30669==    by 0x92B8EEF: _mesa_clear_shader_program_data (shaderobj.c:334)
==30669==    by 0x93A21BD: _mesa_glsl_link_shader (ir_to_mesa.cpp:3088)
==30669==    by 0x92B7158: link_program (shaderapi.c:1163)
==30669==    by 0x92B7158: link_program_error (shaderapi.c:1241)


Here DrawElements ends up calling gen6_upload_push_constants() (which accesses the now freed uniform storage) but I would have presumed constants would still be in the correct state.
Comment 5 Kenneth Graunke 2017-08-14 07:24:26 UTC
Nah, constant upload can happen for a variety of reasons.

I think we need to keep prog->UniformStorage around longer, rather than blowing it away in _mesa_clear_shader_program_data().
Comment 6 Mark Janes 2017-08-28 21:19:13 UTC
This bug has been around much longer than Tim's patch, and shouldn't block the release.
Comment 7 Neil Roberts 2017-11-02 18:46:50 UTC
I sent a potential patch for this to the mailing list here:

https://patchwork.freedesktop.org/patch/185974/

I also made a narrowed-down Piglit test which has a 98% failure rate here:

https://patchwork.freedesktop.org/patch/185975/
Comment 8 Timothy Arceri 2017-11-08 01:43:04 UTC
(In reply to Neil Roberts from comment #7)
> I sent a potential patch for this to the mailing list here:
> 
> https://patchwork.freedesktop.org/patch/185974/
> 
> I also made a narrowed-down Piglit test which has a 98% failure rate here:
> 
> https://patchwork.freedesktop.org/patch/185975/

Thanks for the test. When Ken pointed out the source of the problem I was sure I'd already done work to fix this as all the data used to be attached to gl_linked_shader which gets trashed every time we relink. I moved the data required by the backend to gl_shader_program_data but it seems I forgot to finish of the last step of creating a fresh instance of gl_shader_program_data when we link. I've sent some patches to fix this:

https://patchwork.freedesktop.org/series/33378/
Comment 9 Timothy Arceri 2017-11-10 04:58:15 UTC
Fixed by:

commit	6a72eba755fea15a0d97abb913a6315d9d32e274
mesa: rework how we free gl_shader_program_data

When I introduced gl_shader_program_data one of the intentions was to
fix a bug where a failed linking attempt freed data required by a
currently active program. However I seem to have failed to finish
hooking up the final steps required to have the data hang around.

Here we create a fresh instance of gl_shader_program_data every
time we link. gl_program has a reference to gl_shader_program_data
so it will be freed once the program is no longer active.

Cc: "17.2 17.3" <mesa-stable@lists.freedesktop.org>
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: Neil Roberts <nroberts@igalia.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=102177

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.