Summary: | [REGRESSION] [BISECTED] [GLESCTS] race between destruction of types and shader compilation (?) | ||
---|---|---|---|
Product: | Mesa | Reporter: | Andrés Gómez García <agomez> |
Component: | Drivers/DRI/i965 | Assignee: | Tapani Pälli <lemody> |
Status: | RESOLVED FIXED | QA Contact: | Intel 3D Bugs Mailing List <intel-3d-bugs> |
Severity: | normal | ||
Priority: | medium | CC: | agoldmints, agomez, gpoo+bfdo, jasuarez, lemody, mark.a.janes, t_arceri |
Version: | git | Keywords: | bisected, regression |
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Bug Depends on: | |||
Bug Blocks: | 102590 | ||
Attachments: |
BT for the core running KHR-GLES31.core.gpu_shader5.texture_gather_offsets_color
BT for the core running dEQP-GLES31.functional.texture.gather.offsets.min_required_offset.2d.rgba8.size_pot.clamp_to_edge_repeat BT for the core running dEQP-GLES31.functional.texture.gather.offsets.min_required_offset.2d.rgba8.size_pot.clamp_to_edge_repeat without mesa_glthread move glsl_type dtor back to atexit fix |
Description
Andrés Gómez García
2019-05-29 23:15:21 UTC
Created attachment 144377 [details]
BT for the core running dEQP-GLES31.functional.texture.gather.offsets.min_required_offset.2d.rgba8.size_pot.clamp_to_edge_repeat
If needed, I can provide a docker image with the needed mesa and VK-GL-CTS versions with which to reproduce this SIGSEV. *sigh* will try to reproduce this (In reply to Andrés Gómez García from comment #0) > **Notice that this cannot be reproduce by testing just the problematic test > or the CTS configuration in which the test exists. I've only been able to > reproduce through running with cts-runner for es32** With VK-GL-CTS compiled against Wayland. Andrés, could you try if reverting following commit has any effect on this: https://gitlab.freedesktop.org/mesa/mesa/commit/dca36d5516d0fdaf012b4476975c5d585c2d1a09 (or maybe just disabling gl thread option) (In reply to Tapani Pälli from comment #5) > Andrés, could you try if reverting following commit has any effect on this: > > https://gitlab.freedesktop.org/mesa/mesa/commit/ > dca36d5516d0fdaf012b4476975c5d585c2d1a09 > > (or maybe just disabling gl thread option) mesa_glthread is not activated by default. Still, I've also reverted that commit and tried again and I can confirm that the problem still is reproducible. Created attachment 144677 [details]
BT for the core running dEQP-GLES31.functional.texture.gather.offsets.min_required_offset.2d.rgba8.size_pot.clamp_to_edge_repeat without mesa_glthread
(In reply to Andrés Gómez García from comment #0) > The "problematic" tests in which the execution could SIGSEV are: > > dEQP-GLES31.functional.texture.gather.offsets.min_required_offset.2d.rgba8. > size_pot.clamp_to_edge_repeat I can also replicate this problem with Mesa master, and applying Ken's two-configs fix attempt for bug 110357: https://gitlab.freedesktop.org/kwg/mesa/tree/pb565-two-configs > KHR-GLES31.core.gpu_shader5.texture_gather_offsets_color *** Bug 111114 has been marked as a duplicate of this bug. *** In my opinion, this should have blocked the 19.1 release. I'm altering the title to make it clear that this prevents i965 from being compliant with the gles cts. Intel CI is an effective sieve for unfixed bisected regressions, and prevents them from going out in a release. However bugs (like this one) which are not seen in CI have no mechanism that ensures they block the next release. The bug originator is unlikely to remember that the bug is unfixed and add it to the release tracker. The release manager may search in bugzilla for unfixed bugs affecting the release, but will find a lot of noise in the results. Juan, do you have any suggestions for how we can catch these bugs in the release process? I would suggest relying on the "bisected, regression" keywords, but this list seems too long for me: https://bugs.freedesktop.org/buglist.cgi?bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&keywords=bisected%2C%20regression%2C%20&keywords_type=allwords&list_id=676302&order=bug_id%20DESC&product=Mesa&query_format=advanced We should be marking some of those bugs as WONTFIX etc. (In reply to Mark Janes from comment #10) > > > Juan, do you have any suggestions for how we can catch these bugs in the > release process? I would suggest relying on the "bisected, regression" > keywords, but this list seems too long for me: > > https://bugs.freedesktop.org/buglist. > cgi?bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&keywords=bisected% > 2C%20regression%2C%20&keywords_type=allwords&list_id=676302&order=bug_id%20DE > SC&product=Mesa&query_format=advanced > Maybe we can short the list if we search for all the bugs created since the latest branch point (for 19.1, they would be all the tests created after 19.0 branchpoint). The rationale is that all the previous bugs didn't block 19.0, so they shouldn't block 19.1. (In reply to Juan A. Suarez from comment #11) > (In reply to Mark Janes from comment #10) > > https://bugs.freedesktop.org/buglist. > > cgi?bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&keywords=bisected% > > 2C%20regression%2C%20&keywords_type=allwords&list_id=676302&order=bug_id%20DE > > SC&product=Mesa&query_format=advanced > > Maybe we can short the list if we search for all the bugs created since the > latest branch point (for 19.1, they would be all the tests created after > 19.0 branchpoint). The rationale is that all the previous bugs didn't block > 19.0, so they shouldn't block 19.1. Makes sense to me. Auditing new bugs before release will improve the quality of our releases. (In reply to Mark Janes from comment #10) > In my opinion, this should have blocked the 19.1 release. I think you are right. > I'm altering the title to make it clear that this prevents i965 from being > compliant with the gles cts. Thanks! > The bug originator is unlikely to remember that the bug is unfixed and add > it to the release tracker. The release manager may search in bugzilla for > unfixed bugs affecting the release, but will find a lot of noise in the > results. Unfortunately, at the time of reporting it was unclear to me whether this was a regression of just a problem happening in a very specific condition. Notice that I was not using master directly but a patched master :( Andrés, have you been able to reproduce this using 'deqp-gles31' directly instead of 'cts-runner'? I'm asking because I'm experiencing weird issues with 'cts-runner', for some reason Mesa loader goes nuts and refuses to load or find driver, this seems to happen when 'cts-runner' is going through EGL module tests. (In reply to Tapani Pälli from comment #14) > Andrés, have you been able to reproduce this using 'deqp-gles31' directly > instead of 'cts-runner'? I'm asking because I'm experiencing weird issues > with 'cts-runner', for some reason Mesa loader goes nuts and refuses to load > or find driver, this seems to happen when 'cts-runner' is going through EGL > module tests. OK, got that solved. Now using 'cts-runner' and trying to reproduce. (In reply to Andrés Gómez García from comment #2) > If needed, I can provide a docker image with the needed mesa and VK-GL-CTS > versions with which to reproduce this SIGSEV. Which versions should be used to reproduce? I cannot seem to reproduce this with: vk-gl-cts: 7f6762b95232c0959f6f96f88fe91391c773d960 mesa: 6fc7384fd44f0b42c6decac4468bba06b28a8186 I've configured vk-gl-cts for wayland and running this under Weston. I've tried running with vblank_mode=0 and without. For the run I've edited txt files so that the run starts near texture.gather tests for quicker reproduce. Tapani: build the gles 3.2.5 branch of the cts: afb0b4b17f0969c7e9353b622a02658877cb0e7e Suppress Clang 7 self-assignment warnings Created attachment 144911 [details] [review] move glsl_type dtor back to atexit Here's something that could be the solution. Having said that, I'm not still clear what the actual issue is :/ But will attempt to dig and debug more. (In reply to Tapani Pälli from comment #16) > (In reply to Andrés Gómez García from comment #2) > > If needed, I can provide a docker image with the needed mesa and VK-GL-CTS > > versions with which to reproduce this SIGSEV. > > Which versions should be used to reproduce? I cannot seem to reproduce this > with: > > vk-gl-cts: 7f6762b95232c0959f6f96f88fe91391c773d960 > mesa: 6fc7384fd44f0b42c6decac4468bba06b28a8186 > > I've configured vk-gl-cts for wayland and running this under Weston. I've > tried running with vblank_mode=0 and without. For the run I've edited txt > files so that the run starts near texture.gather tests for quicker reproduce. Tapani & I are failing to reproduce. What distribution are you running this on? Can get some hints in valgrind now : ==26855== Invalid read of size 8 ==26855== at 0x6192980: prototype_string(glsl_type const*, char const*, exec_list*) (ast_function.cpp:92) ==26855== by 0x619440C: print_function_prototypes(_mesa_glsl_parse_state*, YYLTYPE*, ir_function*) (ast_function.cpp:777) ==26855== by 0x61945CE: no_matching_function_error(char const*, YYLTYPE*, exec_list*, _mesa_glsl_parse_state*) (ast_function.cpp:812) ==26855== by 0x6198F77: ast_function_expression::hir(exec_list*, _mesa_glsl_parse_state*) (ast_function.cpp:2386) ==26855== by 0x619C056: ast_expression::do_hir(exec_list*, _mesa_glsl_parse_state*, bool) (ast_to_hir.cpp:1408) ==26855== by 0x619BE9A: ast_expression::hir_no_rvalue(exec_list*, _mesa_glsl_parse_state*) (ast_to_hir.cpp:1310) ==26855== by 0x619F18F: ast_expression_statement::hir(exec_list*, _mesa_glsl_parse_state*) (ast_to_hir.cpp:2236) ==26855== by 0x619F202: ast_compound_statement::hir(exec_list*, _mesa_glsl_parse_state*) (ast_to_hir.cpp:2252) ==26855== by 0x61A6FFD: ast_function_definition::hir(exec_list*, _mesa_glsl_parse_state*) (ast_to_hir.cpp:6223) ==26855== by 0x6199A07: _mesa_ast_to_hir(exec_list*, _mesa_glsl_parse_state*) (ast_to_hir.cpp:159) ==26855== by 0x60E0BE9: _mesa_glsl_compile_shader (glsl_parser_extras.cpp:2157) ==26855== by 0x5DBFCE0: _mesa_compile_shader (shaderapi.c:1200) ==26855== Address 0x13a78550 is 16 bytes inside a block of size 224 free'd ==26855== at 0x48369AB: free (vg_replace_malloc.c:540) ==26855== by 0x5D00BC8: unsafe_free (ralloc.c:299) ==26855== by 0x5D00B8D: unsafe_free (ralloc.c:292) ==26855== by 0x5D00B8D: unsafe_free (ralloc.c:292) ==26855== by 0x5D00AB2: ralloc_free (ralloc.c:262) ==26855== by 0x60E163B: glsl_symbol_table::operator delete(void*) (glsl_symbol_table.h:43) ==26855== by 0x60E0DAF: _mesa_glsl_compile_shader (glsl_parser_extras.cpp:2191) ==26855== by 0x5DBFCE0: _mesa_compile_shader (shaderapi.c:1200) ==26855== by 0x5DC0D55: _mesa_CompileShader (shaderapi.c:1569) ==26855== by 0x12DDA16: glcts::TestCaseBase::buildProgramVA(unsigned int, bool*, unsigned int, ...) (esextcTestCaseBase.cpp:424) ==26855== by 0x12DE1C9: glcts::TestCaseBase::buildProgram(unsigned int, unsigned int, unsigned int, char const* const*, unsigned int, unsigned int, char const* const*, unsigned int, unsigned int, char const* const*, bool*) (esextcTestCaseBase.cpp:581) ==26855== by 0x1393DAE: glcts::GeometryShaderFlatInterpolationTest::initProgram() (esextcGeometryShaderQualifiers.cpp:181) ==26855== Block was alloc'd at ==26855== at 0x483577F: malloc (vg_replace_malloc.c:309) ==26855== by 0x5D00665: ralloc_size (ralloc.c:119) ==26855== by 0x5D00721: rzalloc_size (ralloc.c:151) ==26855== by 0x5FBF655: exec_node::operator new(unsigned long, void*) (list.h:58) ==26855== by 0x60C806C: (anonymous namespace)::builtin_variable_generator::add_const(char const*, int, int) (builtin_variables.cpp:655) ==26855== by 0x60C7968: (anonymous namespace)::builtin_variable_generator::add_const(char const*, int) (builtin_variables.cpp:454) ==26855== by 0x60C8813: (anonymous namespace)::builtin_variable_generator::generate_constants() (builtin_variables.cpp:817) ==26855== by 0x60CB436: _mesa_glsl_initialize_variables(exec_list*, _mesa_glsl_parse_state*) (builtin_variables.cpp:1543) ==26855== by 0x619993E: _mesa_ast_to_hir(exec_list*, _mesa_glsl_parse_state*) (ast_to_hir.cpp:131) ==26855== by 0x60E0BE9: _mesa_glsl_compile_shader (glsl_parser_extras.cpp:2157) ==26855== by 0x5DBFCE0: _mesa_compile_shader (shaderapi.c:1200) ==26855== by 0x5DC0D55: _mesa_CompileShader (shaderapi.c:1569) ==26855== (In reply to Tapani Pälli from comment #18) > Created attachment 144911 [details] [review] [review] > move glsl_type dtor back to atexit > > Here's something that could be the solution. Having said that, I'm not still > clear what the actual issue is :/ But will attempt to dig and debug more. Tapani: with this patch, I can't reproduce the crash. In my environment, the crash reproduces immediately without your patch. Andrés: can you try the attachment to see if it resolves the bug in your environment? (In reply to Tapani Pälli from comment #18) > Created attachment 144911 [details] [review] [review] > move glsl_type dtor back to atexit > > Here's something that could be the solution. Having said that, I'm not still > clear what the actual issue is :/ But will attempt to dig and debug more. I can't reproduce anymore after applying this patch. I'm using X11 in Fedora. (In reply to Lionel Landwerlin from comment #23) > https://gitlab.freedesktop.org/mesa/piglit/merge_requests/103 Reproduced! \o/ I guess ideally we would solve this without using atexit(), I'll try to see if there would be another way. I have another alternative fix, will do some proper testing and CI Created attachment 144916 [details] [review] fix Here's the fix, needs still some careful testing. (In reply to Tapani Pälli from comment #26) > Created attachment 144916 [details] [review] [review] > fix > > Here's the fix, needs still some careful testing. Tested the fix. Don't get the segfault in cts with patch. This is now resolved in staging/19.1 branch (meaning next 19.1 release) with following commit. There will be a different fix for master branch. ------- 8< ------ Author: Tapani Pälli <tapani.palli@intel.com> Date: Tue Aug 6 13:10:10 2019 +0300 mesa: add glsl_type ref to one_time_init and decref to atexit This fixes problems spotted within vk-gl-cts. Problem is that the builtin functions refer to types and we should not release types before builtins are released. Fixes: 624789e3708c ("compiler/glsl: handle case where we have multiple users for types") Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110796 Signed-off-by: Tapani Pälli <tapani.palli@intel.com> Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> (In reply to Mark Janes from comment #21) > (In reply to Tapani Pälli from comment #18) > > Created attachment 144911 [details] [review] [review] [review] > > move glsl_type dtor back to atexit > > > > Here's something that could be the solution. Having said that, I'm not still > > clear what the actual issue is :/ But will attempt to dig and debug more. > > Tapani: with this patch, I can't reproduce the crash. In my environment, > the crash reproduces immediately without your patch. > > Andrés: can you try the attachment to see if it resolves the bug in your > environment? Tested the patch with my original environment and I can confirm it is fixing the SIGSEV. I also tested against the newly added piglit test at https://gitlab.freedesktop.org/mesa/piglit/merge_requests/103 and I can also confirm that it solves the issue. Fixed in master by: --- 8< --- commit e4da8b9c331cc3ae1b86b3a7860231e202463db0 Author: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Date: Wed Jul 31 12:12:10 2019 +0300 mesa/compiler: rework tear down of builtin/types The issue we're running into when running CTS is that glsl types are deleted while builtins depending on them are not. This happens because on one hand we have glsl types ref counted, but builtins are not. Instead builtins are destroyed when unloading libGL or explicitly calling glReleaseShaderCompiler(). This change removes almost entirely any dealing with glsl types ref/unref by letting the builtins deal with it instead. In turn we introduce a builtin ref count mechanism. Each GL context takes a reference on the builtins when compiling a shader for the first time. It releases the reference when the context is destroyed. It can also explicitly release those when glReleaseShaderCompiler() is called. Finally we also take a reference on the glsl types when loading libGL to avoid recreating glsl types too often. v2: Ensure we take a reference if we don't have one in link step (Lionel) Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110796 Reviewed-by: Eric Anholt <eric@anholt.net> Reviewed-by: Tapani Pälli <tapani.palli@intel.com> |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.