Created attachment 122975 [details] apitrace There are some uninitialized variables in current master that are causing some of the VTK unit tests to segfault randomly, or just not produce any output. I've included as much information as I can think of below. Feel free to ping me if more info is needed. -------------------------------------------- Configure options: ./autogen.sh \ --enable-debug \ --prefix="$prefix" \ --disable-dri \ --disable-egl \ --disable-gles1 \ --disable-gles2 \ --disable-shared-glapi \ --enable-xlib-glx \ --enable-gallium-osmesa \ --with-gallium-drivers=swrast \ --enable-gallium-llvm=yes \ LLVM_CONFIG=/usr/bin/llvm-config-3.6 \ --enable-llvm-shared-libs \ --with-gl-lib-name=MesaGL \ --with-osmesa-lib-name=MesaOSMesa --------------------------------------------- Install with: make install-data cd src/gallium make install-exec (Regular 'make install' does this in the install dir for some reason: $ ls mesa-master-install/lib/libMesaGL.so* -lh mesa-master-install/lib/libMesaGL.so -> libMesaGL.so.1.5.0 mesa-master-install/lib/libMesaGL.so.1 -> libMesaGL.so.1.6.0 mesa-master-install/lib/libMesaGL.so.1.5.0 mesa-master-install/lib/libMesaGL.so.1.6.0 which confuses my linker/loader ;) Another bug?) --------------------------------------------- memcheck the attached apitrace with (obviously change paths as needed): MESA_GL_VERSION_OVERRIDE=4.5 \ LD_LIBRARY_PATH=/ssd/src/llvm-3.8.0.install/lib \ LD_PRELOAD=/ssd/src/mesa-master-install/lib/libMesaGL.so \ valgrind --tool=memcheck \ glretrace vtkRenderingOpenGL2CxxTests.4075.trim.trace --------------------------------------------- Sample valgrind stacks: ==32054== Conditional jump or move depends on uninitialised value(s) ==32054== at 0x5367CF7: util_framebuffer_state_equal (u_framebuffer.c:58) ==32054== by 0x5444AFE: llvmpipe_set_framebuffer_state (lp_state_surface.c:54) ==32054== by 0x53561DA: util_blitter_blit_generic (u_blitter.c:1694) ==32054== by 0x5356819: util_blitter_blit (u_blitter.c:1813) ==32054== by 0x544602C: lp_blit (lp_surface.c:117) ==32054== by 0x51705F7: st_CopyTexSubImage (st_cb_texture.c:2672) ==32054== by 0x50B2B03: copytexsubimage_by_slice (teximage.c:3459) ==32054== by 0x50B330D: copyteximage (teximage.c:3644) ==32054== by 0x50B3476: _mesa_CopyTexImage2D (teximage.c:3680) ==32054== by 0x4D340E: ??? (in /usr/bin/glretrace) ==32054== by 0x40CCCC: ??? (in /usr/bin/glretrace) ==32054== by 0x40D2A7: ??? (in /usr/bin/glretrace) ==32054== Conditional jump or move depends on uninitialised value(s) ==32054== at 0x5409DEE: lp_build_blend_factor_unswizzled (lp_bld_blend_aos.c:98) ==32054== by 0x540A2A5: lp_build_blend_factor (lp_bld_blend_aos.c:262) ==32054== by 0x540A53C: lp_build_blend_aos (lp_bld_blend_aos.c:352) ==32054== by 0x543C200: generate_unswizzled_blend (lp_state_fs.c:2094) ==32054== by 0x543D32D: generate_fragment (lp_state_fs.c:2434) ==32054== by 0x543DEAC: generate_variant (lp_state_fs.c:2619) ==32054== by 0x543F464: llvmpipe_update_fs (lp_state_fs.c:3171) ==32054== by 0x5435D35: llvmpipe_update_derived (lp_state_derived.c:209) ==32054== by 0x5410AC0: llvmpipe_draw_vbo (lp_draw_arrays.c:70) ==32054== by 0x52E28C4: cso_draw_vbo (cso_context.c:1629) ==32054== by 0x5174712: st_draw_vbo (st_draw.c:251) ==32054== by 0x511BE24: vbo_validated_drawrangeelements (vbo_exec_array.c:844) ==32054== Conditional jump or move depends on uninitialised value(s) ==32054== at 0x404002D: ??? ==32054== by 0x5415736: lp_rast_shade_quads_mask (lp_rast.c:457) ==32054== by 0x541999F: lp_rast_triangle_32_3_16 (lp_rast_tri.c:346) ==32054== by 0x5415A81: do_rasterize_bin (lp_rast.c:609) ==32054== by 0x5415AEF: rasterize_bin (lp_rast.c:628) ==32054== by 0x5415BFE: rasterize_scene (lp_rast.c:688) ==32054== by 0x5415EE3: thread_function (lp_rast.c:828) ==32054== by 0x5413C4A: impl_thrd_routine (threads_posix.h:87) ==32054== by 0x5BF6423: start_thread (in /usr/lib/libpthread-2.23.so) ==32054== by 0x720CCBC: clone (in /usr/lib/libc-2.23.so)
(In reply to David Lonie from comment #0) > Created attachment 122975 [details] > apitrace > > There are some uninitialized variables in current master that are causing > some of the VTK unit tests to segfault randomly, or just not produce any > output. I've included as much information as I can think of below. > > Feel free to ping me if more info is needed. > > -------------------------------------------- > Configure options: > > ./autogen.sh \ > --enable-debug \ > --prefix="$prefix" \ > --disable-dri \ > --disable-egl \ > --disable-gles1 \ > --disable-gles2 \ > --disable-shared-glapi \ > --enable-xlib-glx \ > --enable-gallium-osmesa \ > --with-gallium-drivers=swrast \ > --enable-gallium-llvm=yes \ > LLVM_CONFIG=/usr/bin/llvm-config-3.6 \ > --enable-llvm-shared-libs \ > --with-gl-lib-name=MesaGL \ > --with-osmesa-lib-name=MesaOSMesa > > --------------------------------------------- > Install with: > > make install-data > cd src/gallium > make install-exec > > (Regular 'make install' does this in the install dir for some reason: > > $ ls mesa-master-install/lib/libMesaGL.so* -lh > > mesa-master-install/lib/libMesaGL.so -> libMesaGL.so.1.5.0 > > mesa-master-install/lib/libMesaGL.so.1 -> libMesaGL.so.1.6.0 > mesa-master-install/lib/libMesaGL.so.1.5.0 > mesa-master-install/lib/libMesaGL.so.1.6.0 > > which confuses my linker/loader ;) Another bug?) > > --------------------------------------------- > memcheck the attached apitrace with (obviously change paths as needed): > > MESA_GL_VERSION_OVERRIDE=4.5 \ > LD_LIBRARY_PATH=/ssd/src/llvm-3.8.0.install/lib \ > LD_PRELOAD=/ssd/src/mesa-master-install/lib/libMesaGL.so \ > valgrind --tool=memcheck \ > glretrace vtkRenderingOpenGL2CxxTests.4075.trim.trace > > --------------------------------------------- > Sample valgrind stacks: > ==32054== Conditional jump or move depends on uninitialised value(s) > ==32054== at 0x5367CF7: util_framebuffer_state_equal (u_framebuffer.c:58) > ==32054== by 0x5444AFE: llvmpipe_set_framebuffer_state > (lp_state_surface.c:54) > ==32054== by 0x53561DA: util_blitter_blit_generic (u_blitter.c:1694) > ==32054== by 0x5356819: util_blitter_blit (u_blitter.c:1813) > ==32054== by 0x544602C: lp_blit (lp_surface.c:117) > ==32054== by 0x51705F7: st_CopyTexSubImage (st_cb_texture.c:2672) > ==32054== by 0x50B2B03: copytexsubimage_by_slice (teximage.c:3459) > ==32054== by 0x50B330D: copyteximage (teximage.c:3644) > ==32054== by 0x50B3476: _mesa_CopyTexImage2D (teximage.c:3680) > ==32054== by 0x4D340E: ??? (in /usr/bin/glretrace) > ==32054== by 0x40CCCC: ??? (in /usr/bin/glretrace) > ==32054== by 0x40D2A7: ??? (in /usr/bin/glretrace) This one looks easy to fix. Though, I wasn't able to reproduce the valgrind warning here with piglit's copytexsubimage test which definitely hits the same code path. > ==32054== Conditional jump or move depends on uninitialised value(s) > ==32054== at 0x5409DEE: lp_build_blend_factor_unswizzled > (lp_bld_blend_aos.c:98) > ==32054== by 0x540A2A5: lp_build_blend_factor (lp_bld_blend_aos.c:262) > ==32054== by 0x540A53C: lp_build_blend_aos (lp_bld_blend_aos.c:352) > ==32054== by 0x543C200: generate_unswizzled_blend (lp_state_fs.c:2094) > ==32054== by 0x543D32D: generate_fragment (lp_state_fs.c:2434) > ==32054== by 0x543DEAC: generate_variant (lp_state_fs.c:2619) > ==32054== by 0x543F464: llvmpipe_update_fs (lp_state_fs.c:3171) > ==32054== by 0x5435D35: llvmpipe_update_derived (lp_state_derived.c:209) > ==32054== by 0x5410AC0: llvmpipe_draw_vbo (lp_draw_arrays.c:70) > ==32054== by 0x52E28C4: cso_draw_vbo (cso_context.c:1629) > ==32054== by 0x5174712: st_draw_vbo (st_draw.c:251) > ==32054== by 0x511BE24: vbo_validated_drawrangeelements > (vbo_exec_array.c:844) I don't see what's wrong here. At lp_bld_blend_aos.c:98 we're examining fields of the bld object. But the whole bld object is initialized to zeros at line 321. > > ==32054== Conditional jump or move depends on uninitialised value(s) > ==32054== at 0x404002D: ??? > ==32054== by 0x5415736: lp_rast_shade_quads_mask (lp_rast.c:457) > ==32054== by 0x541999F: lp_rast_triangle_32_3_16 (lp_rast_tri.c:346) > ==32054== by 0x5415A81: do_rasterize_bin (lp_rast.c:609) > ==32054== by 0x5415AEF: rasterize_bin (lp_rast.c:628) > ==32054== by 0x5415BFE: rasterize_scene (lp_rast.c:688) > ==32054== by 0x5415EE3: thread_function (lp_rast.c:828) > ==32054== by 0x5413C4A: impl_thrd_routine (threads_posix.h:87) > ==32054== by 0x5BF6423: start_thread (in /usr/lib/libpthread-2.23.so) > ==32054== by 0x720CCBC: clone (in /usr/lib/libc-2.23.so) Not sure about this one. I suspect our jitted code does some vector ops on some unused elements and that triggers the valgrind warning, but that should be harmless. Roland??
(In reply to Brian Paul from comment #1) > (In reply to David Lonie from comment #0) > > ==32054== Conditional jump or move depends on uninitialised value(s) > > ==32054== at 0x5409DEE: lp_build_blend_factor_unswizzled > > (lp_bld_blend_aos.c:98) > > ==32054== by 0x540A2A5: lp_build_blend_factor (lp_bld_blend_aos.c:262) > > ==32054== by 0x540A53C: lp_build_blend_aos (lp_bld_blend_aos.c:352) > > ==32054== by 0x543C200: generate_unswizzled_blend (lp_state_fs.c:2094) > > ==32054== by 0x543D32D: generate_fragment (lp_state_fs.c:2434) > > ==32054== by 0x543DEAC: generate_variant (lp_state_fs.c:2619) > > ==32054== by 0x543F464: llvmpipe_update_fs (lp_state_fs.c:3171) > > ==32054== by 0x5435D35: llvmpipe_update_derived (lp_state_derived.c:209) > > ==32054== by 0x5410AC0: llvmpipe_draw_vbo (lp_draw_arrays.c:70) > > ==32054== by 0x52E28C4: cso_draw_vbo (cso_context.c:1629) > > ==32054== by 0x5174712: st_draw_vbo (st_draw.c:251) > > ==32054== by 0x511BE24: vbo_validated_drawrangeelements > > (vbo_exec_array.c:844) > > I don't see what's wrong here. At lp_bld_blend_aos.c:98 we're examining > fields of the bld object. But the whole bld object is initialized to zeros > at line 321. The problem is we're later assigning it the src1_alpha value coming from the fs. And this one is only initialized if we have dual source blend. I'll see if I can fix this easily, as the warning is annoying (I've seen it before), however it is definitely 100% harmless and will not cause any issues whatsoever (the conditional assignment may rely on uninitialized values, however the assigned value will be unused anyway). > > > > > ==32054== Conditional jump or move depends on uninitialised value(s) > > ==32054== at 0x404002D: ??? > > ==32054== by 0x5415736: lp_rast_shade_quads_mask (lp_rast.c:457) > > ==32054== by 0x541999F: lp_rast_triangle_32_3_16 (lp_rast_tri.c:346) > > ==32054== by 0x5415A81: do_rasterize_bin (lp_rast.c:609) > > ==32054== by 0x5415AEF: rasterize_bin (lp_rast.c:628) > > ==32054== by 0x5415BFE: rasterize_scene (lp_rast.c:688) > > ==32054== by 0x5415EE3: thread_function (lp_rast.c:828) > > ==32054== by 0x5413C4A: impl_thrd_routine (threads_posix.h:87) > > ==32054== by 0x5BF6423: start_thread (in /usr/lib/libpthread-2.23.so) > > ==32054== by 0x720CCBC: clone (in /usr/lib/libc-2.23.so) > > Not sure about this one. I suspect our jitted code does some vector ops on > some unused elements and that triggers the valgrind warning, but that should > be harmless. Roland?? This one is actually really annoying, as it turns up everywhere and thus makes valgrind output spammy. I'm quite convinced it's completely harmless, but so far I've been unable to eliminate it... But in any case, I HIGHLY doubt these two are the reason for any random segfaults, I certainly don't see any evidence here. So, a backtrace of the actual crash would probably be more useful.
(In reply to David Lonie from comment #0) > (Regular 'make install' does this in the install dir for some reason: > > $ ls mesa-master-install/lib/libMesaGL.so* -lh > > mesa-master-install/lib/libMesaGL.so -> libMesaGL.so.1.5.0 > > mesa-master-install/lib/libMesaGL.so.1 -> libMesaGL.so.1.6.0 > mesa-master-install/lib/libMesaGL.so.1.5.0 > mesa-master-install/lib/libMesaGL.so.1.6.0 > > which confuses my linker/loader ;) Another bug?) > Indeed it is - your colleague (?) Chuck Atkins is working on that one. See bug#94086.
(In reply to Roland Scheidegger from comment #2) > > > ==32054== Conditional jump or move depends on uninitialised value(s) > > > ==32054== at 0x404002D: ??? > > > ==32054== by 0x5415736: lp_rast_shade_quads_mask (lp_rast.c:457) > > > ==32054== by 0x541999F: lp_rast_triangle_32_3_16 (lp_rast_tri.c:346) > > > ==32054== by 0x5415A81: do_rasterize_bin (lp_rast.c:609) > > > ==32054== by 0x5415AEF: rasterize_bin (lp_rast.c:628) > > > ==32054== by 0x5415BFE: rasterize_scene (lp_rast.c:688) > > > ==32054== by 0x5415EE3: thread_function (lp_rast.c:828) > > > ==32054== by 0x5413C4A: impl_thrd_routine (threads_posix.h:87) > > > ==32054== by 0x5BF6423: start_thread (in /usr/lib/libpthread-2.23.so) > > > ==32054== by 0x720CCBC: clone (in /usr/lib/libc-2.23.so) > > > > Not sure about this one. I suspect our jitted code does some vector ops on > > some unused elements and that triggers the valgrind warning, but that should > > be harmless. Roland?? > > This one is actually really annoying, as it turns up everywhere and thus > makes valgrind output spammy. I'm quite convinced it's completely harmless, > but so far I've been unable to eliminate it... To elaborate on that, the problem is that some of those complaints from valgrind may be valid, but others are not. And thus the output isn't all that useful. It is possible it's due to a legit bug - in which case rendering might be wrong however it still should not cause segfaults (and I didn't see that with this replay, just bogus rendering). FWIW the trace has some issues, namely it requests a 4.5 context thus needs overrides to run. Not sure if this actually causes problems. I actually tried to give the replay a look, but the problem is that so far I've always failed to get apitrace + valgrind + gdb to work. The problem is I can't get into gdb on errors on forked processes. valgrind --trace-children=yes works fine, but if you use that with --vgdb=yes --vgdb-error=0 what happens is this works on first (useless) process just fine. However, valgrind will ask to attach a second gdb on the fork, which immediately claims remote connection closed after the "target remote..." command. It will actually ask again after that happens, but if you retry it gdb just complains it can't access memory at some addresses and that's it - on errors valgrind just waits on that gdb connection which doesn't exist... If someone knows if and how this is somehow supposed to work that would be helpful... I think this actually used to work with the --db-enable option but this one is gone for good now.
(In reply to Roland Scheidegger from comment #4) > I actually tried to give the replay a look, but the problem is that so far > I've always failed to get apitrace + valgrind + gdb to work. The problem is > I can't get into gdb on errors on forked processes. Forget this it actually works. I'll see if I can find what caused this specific complaint, but I wouldn't count on it this being the real problem.
(In reply to Roland Scheidegger from comment #5) > (In reply to Roland Scheidegger from comment #4) > > I actually tried to give the replay a look, but the problem is that so far > > I've always failed to get apitrace + valgrind + gdb to work. The problem is > > I can't get into gdb on errors on forked processes. > Forget this it actually works. I'll see if I can find what caused this > specific complaint, but I wouldn't count on it this being the real problem. So, I've tracked this down a bit - the jne instruction where valgrind complains is actually the result of a ptest instruction. This is due to lp_build_mask_check() which tests if not all pixels have been killed (if they have, it would skip execution of the rest of the shader). This "pixel alive" mask is always initialized fully, so initially not undefined (the plane evaluation generated this essentially figuring out which pixels in a 2x2 pixel quad belong to the current triangle). BUT, the shader does something along the lines of: IF whatever temp0.x = something ENDIF KILL_IF temp0.x Interestingly, we actually do zero-intialize all temp regs, thus things still have to be defined, even if that temp0.x assignment didn't assign all channels according to the exec mask. The kill_if will then combine the pixel alive mask with the mask generated from temp0.x < 0 comparison, then do the lp_build_mask_check() using this mask which triggers the warning. Ultimately, I do not really know why valgrind thinks it's undefined. However, to my knowledge valgrind can track definedness of values at a bit level (which is quite an amazing feat indeed), but NOT for sse regs. There's some warnings it might be less accurate there. (FWIW --track-origins=yes simply points to lp_rast_arg_triangle_contained (lp_rast.h:220)) Theoretically, of course it is also possible the undefinedness directly comes from a buggy application itself (think something simple like glUniform call using an undefined value in the first place). But I've seen this warning so many times and everything still worked that I do not believe valgrind complaints which are directed toward mask checks point to any real problem (and I have doubts they are even valid). So, I'm going to mark this bug as fixed. Two minor issues have been addressed (well actually the fb one could have real consequences resulting in unnecessary state updates), and I'm not seeing any random segfaults in any case. Even if valgrind is right about the uninitialized values in the jit code that won't cause crashes (valgrind would say invalid read/write for anything which could crash). Feel free to open a new bug if you see crashes (preferably with a backtrace) or misrenderings.
> Comment # 1 on bug 94955 from Brian Paul > (In reply to David Lonie from comment #0) > > ==32054== Conditional jump or move depends on uninitialised value(s) > > ==32054== at 0x5367CF7: util_framebuffer_state_equal (u_framebuffer.c:58) > > ==32054== by 0x5444AFE: llvmpipe_set_framebuffer_state > > (lp_state_surface.c:54) > > ==32054== by 0x53561DA: util_blitter_blit_generic (u_blitter.c:1694) > > ==32054== by 0x5356819: util_blitter_blit (u_blitter.c:1813) > > ==32054== by 0x544602C: lp_blit (lp_surface.c:117) > > ==32054== by 0x51705F7: st_CopyTexSubImage (st_cb_texture.c:2672) > > ==32054== by 0x50B2B03: copytexsubimage_by_slice (teximage.c:3459) > > ==32054== by 0x50B330D: copyteximage (teximage.c:3644) > > ==32054== by 0x50B3476: _mesa_CopyTexImage2D (teximage.c:3680) > > ==32054== by 0x4D340E: ??? (in /usr/bin/glretrace) > > ==32054== by 0x40CCCC: ??? (in /usr/bin/glretrace) > > ==32054== by 0x40D2A7: ??? (in /usr/bin/glretrace) > > This one looks easy to fix. Though, I wasn't able to reproduce the valgrind > warning here with piglit's copytexsubimage test which definitely hits the same > code path. I've poked at it, and this patch seems to do the trick for me: --- a/src/gallium/auxiliary/util/u_blitter.c +++ b/src/gallium/auxiliary/util/u_blitter.c @@ -1573,6 +1573,8 @@ void util_blitter_blit_generic(struct blitter_context *blitter, fb_state.nr_cbufs = blit_depth || blit_stencil ? 0 : 1; fb_state.cbufs[0] = NULL; fb_state.zsbuf = NULL; + fb_state.samples = 0; + fb_state.layers = 0; if (blit_depth || blit_stencil) { pipe->bind_blend_state(pipe, ctx->blend[0][0]); > Comment # 2 on bug 94955 from Roland Scheidegger > (In reply to Brian Paul from comment #1) > > (In reply to David Lonie from comment #0) > But in any case, I HIGHLY doubt these two are the reason for any random > segfaults, I certainly don't see any evidence here. So, a backtrace of the > actual crash would probably be more useful. Glad the other two seem harmless. If these aren't likely to cause a segfault I'll keep poking around. The backtrace is difficult to obtain because by the time it crashes the stack is corrupt and the resulting backtrace is meaningless. Since the issue is a stack corruption, it makes it difficult to step through, since the crashes happen somewhat randomly, and when it does all debugger state is lost since the stack is nonsense. Hence why I thought the memory errors would be at fault, but it looks like this is going to be trickier and more subtle than it seemed. I'll keep looking and update when I find something else. > Comment # 3 on bug 94955 from Emil Velikov > (In reply to David Lonie from comment #0) > > which confuses my linker/loader ;) Another bug?) > > > Indeed it is - your colleague (?) Chuck Atkins is working on that one. See > bug#94086. So he is! Glad it's a known issue, that one took some work to track down ;) > Comment # 4 on bug 94955 from Roland Scheidegger > FWIW the trace has some issues, namely it requests a 4.5 context thus needs > overrides to run. Not sure if this actually causes problems. The 4.5 override was needed for my apitrace replay context. The actual segfaults are happening on a machine using a different (I believe 3.2?) context. Is there information around that details how to get a better apitrace for you folks? I have another segfaulting test I could capture for you. > Comment # 6 on bug 94955 from Roland Scheidegger > So, I'm going to mark this bug as fixed. Two minor issues have been addressed > (well actually the fb one could have real consequences resulting in unnecessary > state updates), and I'm not seeing any random segfaults in any case. Even if > valgrind is right about the uninitialized values in the jit code that won't > cause crashes (valgrind would say invalid read/write for anything which could > crash). Feel free to open a new bug if you see crashes (preferably with a > backtrace) or misrenderings. Sounds good to me. Thanks so much for the fast replies and thorough checking of the valgrind reports, even if they turned out to be bogus!
(In reply to David Lonie from comment #7) > > Comment # 4 on bug 94955 from Roland Scheidegger > > FWIW the trace has some issues, namely it requests a 4.5 context thus needs > > overrides to run. Not sure if this actually causes problems. > > The 4.5 override was needed for my apitrace replay context. The actual > segfaults are happening on a machine using a different (I believe 3.2?) > context. > > Is there information around that details how to get a better apitrace for > you folks? I have another segfaulting test I could capture for you. > The problem is, if the trace requires version override, it is difficult to tell if it actually isn't using any functions which are actually known to not really work.
Reopening -- I may have found it. This other failing test looks better for tracking this down. It segfaults reproducibly with this stack trace from valgrind: ==19776== Address 0x207abe3c is not stack'd, malloc'd or (recently) free'd ==19776== ==19776== ==19776== Process terminating with default action of signal 11 (SIGSEGV): dumping core ==19776== Access not within mapped region at address 0x207ABE3C ==19776== at 0x4039042: ??? ==19776== by 0x113B377A: lp_rast_shade_quads_mask (lp_rast.c:457) ==19776== by 0x113B8E5A: do_block_4_2 (lp_rast_tri_tmp.h:67) ==19776== by 0x113B9194: do_block_16_2 (lp_rast_tri_tmp.h:152) ==19776== by 0x113B96DC: lp_rast_triangle_2 (lp_rast_tri_tmp.h:305) ==19776== by 0x113B3AC5: do_rasterize_bin (lp_rast.c:609) ==19776== by 0x113B3B33: rasterize_bin (lp_rast.c:628) ==19776== by 0x113B3C42: rasterize_scene (lp_rast.c:688) ==19776== by 0x113B3F27: thread_function (lp_rast.c:828) ==19776== by 0x113B1C8E: impl_thrd_routine (threads_posix.h:87) ==19776== by 0x11C7A423: start_thread (in /usr/lib/libpthread-2.23.so) ==19776== by 0x4F1DCBC: clone (in /usr/lib/libc-2.23.so) This looks similar to the third trace in my original bug, but it's an access violation instead of an uninitialized value, and some of the frames are different. The core dump valgrind generates is corrupted, unfortunately: (gdb) bt #0 0x0000000004039042 in ?? () #1 0x3ebb67ae3ebb67ae in ?? () #2 0x3ebb67ae3ebb67ae in ?? () #3 0x3ebb67ae3ebb67ae in ?? () #4 0x3ebb67ae3ebb67ae in ?? () #5 0x0000000000000000 in ?? () Running in gdb gives a similar backtrace. Tried getting an apitrace, but the segfault is preventing that from producing anything meaningful, too. Hopefully that stack will be enough...I wish I could get you guys a useful apitrace or core dump to inspect, but this crash is doing a good job of covering its tracks! Best I can do is provide instructions for running the VTK test that reproduces it: 1) git clone https://gitlab.kitware.com/vtk/vtk.git 2) mkdir vtk-build 3) cd vtk-build 4) cmake ../vtk \ -DOPENGL_INCLUDE_DIR=/path/to/mesa/install/prefix/include \ -DOPENGL_gl_LIBRARY=/path/to/mesa/install/prefix/lib/libMesaGL.so \ -DOPENGL_glu_LIBRARY="" 5) make To run the test, either: ctest -R TestTextureRGBADepthPeeling or (to get around the ctest launcher for gdb/valgrind): bin/vtkRenderingCoreCxxTests "TestTextureRGBADepthPeeling" "-D" "ExternalData/Testing" "-T" "Testing/Temporary" "-V" "ExternalData/Rendering/Core/Testing/Data/Baseline/TestTextureRGBADepthPeeling.png
Does the gallium llvm sampler fully support float textures? Both llvmpipe and OpenSWR generate the same sampler and fail identically. I'll attach the sampler ir, if it's useful. Any hints on debugging sampler jit are appreciated. :-)
Created attachment 123134 [details] tex_sample_func Here's the ir for the jit'd sampler. I forced "use_tex_func" so it didn't inline with the FS, although it fails the same either way.
(In reply to Bruce Cherniak from comment #10) > Does the gallium llvm sampler fully support float textures? Both llvmpipe > and OpenSWR generate the same sampler and fail identically. Yes, that's easily supported (it's actually one of the most easy formats code-wise). The sampler in question doesn't even use any filtering, so compared to some other generated sampler funcs it's dead simple. Forcing tex func is actually a really good idea to separate things out (even more so when looking at the assembly), albeit with a float type it should not have been necessary (the logic there what's complex or not is quite flawed actually...) Though it doesn't look like float format to me. More like D24X8. > > I'll attach the sampler ir, if it's useful. Any hints on debugging sampler > jit are appreciated. :-) Any evidence it actually crashed there? Kind of difficult to debug if you don't get a valid address... My guess would be it's sampling the system (so not fbo) depth buffer, and the buffer disappeared in the meantime (if it really is crashing there)... In that case, it might be similar to bug 94522.
Actually it is crashing in sampling, but it's a different sampler. Looks like a TXF, there _might_ be a problem with mip level calculation as it hits a seldomly used path...
Created attachment 123136 [details] gallivm_debug shaders (inline sampler) I got distracted trying to get tex func to dump disassembly. :-$ Seems that in gallivm_compile_module LLVMGetPointerToGlobal returns func_code = 0 (lp_bld_init:621) and causes disassembly to crash. But, that's another problem. So, I switched back to an inline sampler (use_tex_func = 0). All of the shader dump (tgsi,ir,asm) are attached. The faulting address is within the sampler portion of the fragment shader. (gdb) bt 2 #0 0x00007fffe9acec71 in FS () #1 0x00007fffe9c74bdb in BackendSingleSample<0u, 1u, 1u, 0u, 0u> (pDC=0x710780, workerId=0, x=56, y=56, work=..., renderBuffers=...) at ../../../../../src/gallium/drivers/swr/rasterizer/core/backend.cpp:821 (disassembly around $pc) 0x00007fffe9acec4a <+3146>: or %eax,%esp 0x00007fffe9acec4c <+3148>: sarl $0xc1,-0x25(%rcx) 0x00007fffe9acec50 <+3152>: movabs $0x7fffe9acd080,%rcx 0x00007fffe9acec5a <+3162>: vmovdqa (%rcx),%xmm12 0x00007fffe9acec5e <+3166>: vpshufb %xmm12,%xmm0,%xmm3 0x00007fffe9acec63 <+3171>: movslq %ebx,%r10 0x00007fffe9acec66 <+3174>: sar $0x20,%rbx 0x00007fffe9acec6a <+3178>: movslq %edi,%rcx 0x00007fffe9acec6d <+3181>: sar $0x20,%rdi => 0x00007fffe9acec71 <+3185>: vmovd (%r10,%rax,1),%xmm0 0x00007fffe9acec77 <+3191>: vpinsrd $0x1,(%rbx,%rax,1),%xmm0,%xmm0 0x00007fffe9acec7e <+3198>: vpinsrd $0x2,(%rcx,%rax,1),%xmm0,%xmm0 0x00007fffe9acec85 <+3205>: vpinsrd $0x3,(%rdi,%rax,1),%xmm0,%xmm2 0x00007fffe9acec8c <+3212>: movslq %edx,%rcx 0x00007fffe9acec8f <+3215>: sar $0x20,%rdx 0x00007fffe9acec93 <+3219>: movslq %esi,%rdi 0x00007fffe9acec96 <+3222>: sar $0x20,%rsi 0x00007fffe9acec9a <+3226>: vmovd (%rcx,%rax,1),%xmm0 0x00007fffe9acec9f <+3231>: vpinsrd $0x1,(%rdx,%rax,1),%xmm0,%xmm0 0x00007fffe9aceca6 <+3238>: vpinsrd $0x2,(%rdi,%rax,1),%xmm0,%xmm0 0x00007fffe9acecad <+3245>: vpinsrd $0x3,(%rsi,%rax,1),%xmm0,%xmm1 0x00007fffe9acecb4 <+3252>: vpmovzxbw %xmm2,%xmm6 0x00007fffe9acecb9 <+3257>: vpmovzxbw %xmm1,%xmm14 0x00007fffe9acecbe <+3262>: vpsubw %xmm6,%xmm14,%xmm14 0x00007fffe9acecc2 <+3266>: vpmovzxbw %xmm3,%xmm0 0x00007fffe9acecc7 <+3271>: vpmullw %xmm0,%xmm14,%xmm7 0x00007fffe9aceccb <+3275>: vpsrlw $0x8,%xmm7,%xmm7 0x00007fffe9acecd0 <+3280>: vpaddw %xmm6,%xmm7,%xmm6 0x00007fffe9acecd4 <+3284>: vxorps %xmm5,%xmm5,%xmm5 0x00007fffe9acecd8 <+3288>: vpunpckhbw %xmm5,%xmm3,%xmm7 0x00007fffe9acecdc <+3292>: vpunpckhbw %xmm5,%xmm2,%xmm2 0x00007fffe9acece0 <+3296>: vpunpckhbw %xmm5,%xmm1,%xmm1 0x00007fffe9acece4 <+3300>: vxorps %xmm10,%xmm10,%xmm10 (gdb) p/x $r10 $2 = 0x20023fc (gdb) p/x $rax $3 = 0x364f6c0 (gdb) p/x $xmm0 $4 = {v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x80, 0x0, 0x0, 0x0, 0x80, 0x0, 0x0, 0x0, 0x80, 0x0, 0x0, 0x0, 0x80, 0x0, 0x0, 0x0}, v8_int16 = {0x80, 0x0, 0x80, 0x0, 0x80, 0x0, 0x80, 0x0}, v4_int32 = {0x80, 0x80, 0x80, 0x80}, v2_int64 = {0x8000000080, 0x8000000080}, uint128 = 0x0000008000000 0800000008000000080} (gdb) i r rax 0x364f6c0 56948416 rbx 0x20023fc 33563644 rcx 0x20023fc 33563644 rdx 0x200240002002400 144154770528019456 rsi 0x200240002002400 144154770528019456 rdi 0x20023fc 33563644 rbp 0x7fffffffa010 0x7fffffffa010 rsp 0x7fffffff9d20 0x7fffffff9d20 r8 0x1fffffc01fffffc 144115170929541116 r9 0x200000002000000 144115188109410304 r10 0x20023fc 33563644 r11 0x1fffffc01fffffc 144115170929541116 r12 0x0 0 r13 0x0 0 r14 0x200000002000000 144115188109410304 r15 0x7fffffffa2a0 140737488331424 rip 0x7fffe9acec71 0x7fffe9acec71 <FS+3185> eflags 0x10206 [ PF IF RF ] cs 0x33 51 ss 0x2b 43 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0
Created attachment 123137 [details] partial gallium trace Attaching the gallium trace (at least up until the crash). I'm still looking through it, but the 2 resources being sampled are: 3948 pipe_context::create_sampler_view(pipe = 0x01073190, resource = 0x03fe46b0, templ = {format = PIPE_FORMAT_R8G8B8A8_UNORM, u = {tex = {first_layer = 0, last_layer = 0, first_level = 0, last_level = 0}}, swizzle_r = 0, swizzle_g = 1, swizzle_b = 2, swizzle_a = 3}) = 0x03f72690 // time 9 3949 pipe_context::create_sampler_view(pipe = 0x01073190, resource = 0x02be7930, templ = {format = PIPE_FORMAT_R32G32_FLOAT, u = {tex = {first_layer = 0, last_layer = 0, first_level = 0, last_level = 0}}, swizzle_r = 0, swizzle_g = 1, swizzle_b = 4, swizzle_a = 3}) = 0x03fcb740 // time 7
BTW, for my edification, what indicates it might be a bug in the mip level calculation?
(In reply to Bruce Cherniak from comment #11) > Created attachment 123134 [details] > tex_sample_func > > Here's the ir for the jit'd sampler. I forced "use_tex_func" so it didn't > inline with the FS, although it fails the same either way. Actually for me the sampler which crashes is a bog-standard rgba8 one, with linear filtering, repeat wrap, npot. I think I found the problem, the coords are all nans (might just be the shader doing this), and the wrap code isn't quite robust against that. Nasty.
(In reply to Bruce Cherniak from comment #16) > BTW, for my edification, what indicates it might be a bug in the mip level > calculation? That was just a guess as it hit the per-pixel mip computations, which was odd - in fact the texture was explicitly marked as having one level only, so I figured maybe there's something wrong with that logic there, as there was plenty of unnecessary instructions (8 shifts, 8 max, some more - despite that we actually know the result of all this has to be zero anyway). But it's kinda expected actually so was correct, albeit some day might want to improve that (and was in different sampling code, I totally missed that...).
Thanks Roland! You patch fixes the segv I get when running both llvmpipe and OpenSWR; and the test itself now passes on llvmpipe.
(In reply to Bruce Cherniak from comment #19) > Thanks Roland! You patch fixes the segv I get when running both llvmpipe > and OpenSWR; and the test itself now passes on llvmpipe. I wonder what the test expects, I haven't seen anything but NaN coords with that sampler ;-). That can't have possibly given really useful results, unless not crashing is good enough to pass :-).
Thanks for the fix! > Comment # 20 on bug 94955 from Roland Scheidegger > (In reply to Bruce Cherniak from comment #19) > > Thanks Roland! You patch fixes the segv I get when running both llvmpipe > > and OpenSWR; and the test itself now passes on llvmpipe. The segfault is indeed gone, so that's a definite improvement! > I wonder what the test expects, I haven't seen anything but NaN coords with > that sampler ;-). That can't have possibly given really useful results, unless > not crashing is good enough to pass :-). The test is not passing, but it is rendering a portion of the scene (the part that doesn't use floating point textures). What's interesting is that mesa is reporting ARB_texture_float, even though I've not configured with --enable-texture-float, so that's likely part of the problem. I also see error messages popping out from mesa at runtime: Illegal sampler view creation without bind flag Recompiling with --enable-texture-float fixed neither the rendering or the error messages, so it appears something else is still going on. Should this be a new bug? Is there more information I can provide? This can be reproduced in VTK as described in comment #9.
(In reply to David Lonie from comment #21) > > I wonder what the test expects, I haven't seen anything but NaN coords with > > that sampler ;-). That can't have possibly given really useful results, unless > > not crashing is good enough to pass :-). > > The test is not passing, but it is rendering a portion of the scene (the > part that doesn't use floating point textures). Ok. I guess the all nans there as coords have something to do with it... > > Illegal sampler view creation without bind flag This is "typically" harmless for now at least (especially on sw renderers), but an abuse of bind flags (though I don't know about swr, if you make use of the flags to track potential changes of the contents of resources). If a resource is created, it must have sampler view bind flag, otherwise it's illegal to later create a sampler view. The problem is, this requirement doesn't fit well with GL. Sometimes for things as blitting such resources also util code may create sampler views out of them. If you're interested it should be pretty easy to figure out where the result was created, why it doesn't have the sampler view bind flag and why later a sampler view is created regardless, but it most likely isn't the source of the issue. > Should this be a new bug? Is there more information I can provide? This can > be reproduced in VTK as described in comment #9. Yes this should be a new bug (even the ones fixed actually have nothing to do with the bug title already, so let's not make this worse). You probably need to track down what's going wrong. All I can say from when I was looking at the NaN coords, that shader seemed to source the coords directly from an input attribute, and the vs looked fairly simple, so I don't know what happened really. However, the test seemed to possibly use some kind of multi-pass operation, hence it's possible previous shaders didn't output the right result already.
Opened a new bug: https://bugs.freedesktop.org/show_bug.cgi?id=95266 Feel free to close this one.
The sampler crashing due to NaN coords has been fixed by bd07e20d208268382a34dca23ff71a8192bb1525 and the test failing (presumably the reason is the coords shouldn't be NaNs) is tracked in another bug, closing.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.