Bisected to: commit 3fe8d04a6d69fad98228be647ba9b250ea0e3a8b Author: Timothy Arceri <timothy.arceri@collabora.com> Date: Mon Jan 9 16:13:27 2017 +1100 mesa: don't always set _NEW_PROGRAM when linking We only need to set it when linking was successful and the program being linked is currently active. The programs_in_use mask is just used as a flag for now but in a future change we will use it to update the CurrentProgram array. V2: make sure to flush vertices before linking (suggested by Marek) Reviewed-by: Marek Olšák <marek.olsak@amd.com> Fails infrequently, seems to be ~0.6% of executions on my system. Doing what naively looks to me like a revert on master doesn't fix it, although lots has changed around that code since then.
Just to confirm. This is only happening on SKL?
I also saw a failure for the test in CI on Haswell, but haven't confirmed the bisect there, or checked any other platforms.
This isn't Tim's fault. The problem is that the test tries to re-link a program which is currently bound for rendering...but with a bogus shader, which causes a link failure. The GL specification (I can't remember where exactly) says that in this case, the previous program remains bound for rendering and is not destroyed until it's unbound. When re-linking, we immediately trash the program...but because linking fails, and we don't get a new one...we end up leaving the old freed program bound. So when you go to draw...you hit a use-after-free...which may work or fail depending on the malloc calls in the meantime. You can easily see this problem by using 'valgrind'. This is a nasty issue, and I'm not exactly sure what to do about it.
I took a quick look at this after talking to Ken. There seems to be a bunch of scenarios but I'm wondering if in at least some cases we are marking things as dirty when they shouldn't be. For example: ==30669== Invalid read of size 4 ==30669== at 0x9532B94: gen6_upload_push_constants (gen6_constant_state.c:85) ==30669== by 0x9529250: check_and_emit_atom (brw_state_upload.c:456) ==30669== by 0x9529250: brw_upload_pipeline_state (brw_state_upload.c:570) ==30669== by 0x9529250: brw_upload_render_state (brw_state_upload.c:592) ==30669== by 0x951BAAC: brw_try_draw_prims (brw_draw.c:770) ==30669== by 0x951BAAC: brw_draw_prims (brw_draw.c:862) ==30669== by 0x931F0BD: vbo_validated_drawrangeelements (vbo_exec_array.c:925) ==30669== by 0x931F84E: vbo_exec_DrawElements (vbo_exec_array.c:1075) ==30669== by 0xFC744A: glcts::StateInteractionCase::iterate() (in /home/timothy/data/git_mesa/build-es-cts-new/cts/glcts) ==30669== by 0xE19E75: es31cts::TestCaseWrapper::iterate(tcu::TestCase*) (in /home/timothy/data/git_mesa/build-es-cts-new/cts/glcts) ==30669== by 0x27DD23F: tcu::TestSessionExecutor::iterateTestCase(tcu::TestCase*) (in /home/timothy/data/git_mesa/build-es-cts-new/cts/glcts) ==30669== by 0x27DC794: tcu::TestSessionExecutor::iterate() (in /home/timothy/data/git_mesa/build-es-cts-new/cts/glcts) ==30669== by 0x27B1F3E: tcu::App::iterate() (in /home/timothy/data/git_mesa/build-es-cts-new/cts/glcts) ==30669== by 0xE14313: main (in /home/timothy/data/git_mesa/build-es-cts-new/cts/glcts) ==30669== Address 0xa744580 is 48 bytes inside a block of size 64 free'd ==30669== at 0x4C2ED4A: free (vg_replace_malloc.c:530) ==30669== by 0x944C14C: unsafe_free (ralloc.c:269) ==30669== by 0x92B8EEF: _mesa_clear_shader_program_data (shaderobj.c:334) ==30669== by 0x93A21BD: _mesa_glsl_link_shader (ir_to_mesa.cpp:3088) ==30669== by 0x92B7158: link_program (shaderapi.c:1163) ==30669== by 0x92B7158: link_program_error (shaderapi.c:1241) Here DrawElements ends up calling gen6_upload_push_constants() (which accesses the now freed uniform storage) but I would have presumed constants would still be in the correct state.
Nah, constant upload can happen for a variety of reasons. I think we need to keep prog->UniformStorage around longer, rather than blowing it away in _mesa_clear_shader_program_data().
This bug has been around much longer than Tim's patch, and shouldn't block the release.
I sent a potential patch for this to the mailing list here: https://patchwork.freedesktop.org/patch/185974/ I also made a narrowed-down Piglit test which has a 98% failure rate here: https://patchwork.freedesktop.org/patch/185975/
(In reply to Neil Roberts from comment #7) > I sent a potential patch for this to the mailing list here: > > https://patchwork.freedesktop.org/patch/185974/ > > I also made a narrowed-down Piglit test which has a 98% failure rate here: > > https://patchwork.freedesktop.org/patch/185975/ Thanks for the test. When Ken pointed out the source of the problem I was sure I'd already done work to fix this as all the data used to be attached to gl_linked_shader which gets trashed every time we relink. I moved the data required by the backend to gl_shader_program_data but it seems I forgot to finish of the last step of creating a fresh instance of gl_shader_program_data when we link. I've sent some patches to fix this: https://patchwork.freedesktop.org/series/33378/
Fixed by: commit 6a72eba755fea15a0d97abb913a6315d9d32e274 mesa: rework how we free gl_shader_program_data When I introduced gl_shader_program_data one of the intentions was to fix a bug where a failed linking attempt freed data required by a currently active program. However I seem to have failed to finish hooking up the final steps required to have the data hang around. Here we create a fresh instance of gl_shader_program_data every time we link. gl_program has a reference to gl_shader_program_data so it will be freed once the program is no longer active. Cc: "17.2 17.3" <mesa-stable@lists.freedesktop.org> Reviewed-by: Tapani Pälli <tapani.palli@intel.com> Reviewed-by: Neil Roberts <nroberts@igalia.com> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=102177
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.