94955 – Uninitialized variables leads to random segfaults (valgrind log, apitrace attached)

Bug 94955 - Uninitialized variables leads to random segfaults (valgrind log, apitrace attached)

Summary: Uninitialized variables leads to random segfaults (valgrind log, apitrace att...

Status:	RESOLVED FIXED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Other (show other bugs)
Version:	git
Hardware:	All All

Importance:	medium major
Assignee:	mesa-dev
QA Contact:	mesa-dev

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2016-04-15 17:31 UTC by David Lonie
Modified:	2016-05-10 15:43 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments
apitrace (10.20 MB, text/plain) 2016-04-15 17:31 UTC, David Lonie	Details
tex_sample_func (11.37 KB, text/plain) 2016-04-21 19:36 UTC, Bruce Cherniak	Details
gallivm_debug shaders (inline sampler) (354.33 KB, text/plain) 2016-04-22 00:07 UTC, Bruce Cherniak	Details
partial gallium trace (468.26 KB, text/plain) 2016-04-22 00:39 UTC, Bruce Cherniak	Details
View All

Description David Lonie 2016-04-15 17:31:52 UTC

Created attachment 122975 [details]
apitrace

There are some uninitialized variables in current master that are causing some of the VTK unit tests to segfault randomly, or just not produce any output. I've included as much information as I can think of below.

Feel free to ping me if more info is needed.

--------------------------------------------
Configure options:

    ./autogen.sh \
      --enable-debug \
      --prefix="$prefix" \
      --disable-dri \
      --disable-egl \
      --disable-gles1 \
      --disable-gles2 \
      --disable-shared-glapi \
      --enable-xlib-glx \
      --enable-gallium-osmesa \
      --with-gallium-drivers=swrast \
      --enable-gallium-llvm=yes \
        LLVM_CONFIG=/usr/bin/llvm-config-3.6 \
      --enable-llvm-shared-libs \
      --with-gl-lib-name=MesaGL \
      --with-osmesa-lib-name=MesaOSMesa

---------------------------------------------
Install with:

make install-data
cd src/gallium
make install-exec

(Regular 'make install' does this in the install dir for some reason:

$ ls mesa-master-install/lib/libMesaGL.so* -lh                                              
mesa-master-install/lib/libMesaGL.so -> libMesaGL.so.1.5.0                                       
mesa-master-install/lib/libMesaGL.so.1 -> libMesaGL.so.1.6.0
mesa-master-install/lib/libMesaGL.so.1.5.0
mesa-master-install/lib/libMesaGL.so.1.6.0

which confuses my linker/loader ;) Another bug?)

---------------------------------------------
memcheck the attached apitrace with (obviously change paths as needed):

    MESA_GL_VERSION_OVERRIDE=4.5 \
    LD_LIBRARY_PATH=/ssd/src/llvm-3.8.0.install/lib \
    LD_PRELOAD=/ssd/src/mesa-master-install/lib/libMesaGL.so \
    valgrind --tool=memcheck \
        glretrace vtkRenderingOpenGL2CxxTests.4075.trim.trace

---------------------------------------------
Sample valgrind stacks:
==32054== Conditional jump or move depends on uninitialised value(s)
==32054==    at 0x5367CF7: util_framebuffer_state_equal (u_framebuffer.c:58)
==32054==    by 0x5444AFE: llvmpipe_set_framebuffer_state (lp_state_surface.c:54)
==32054==    by 0x53561DA: util_blitter_blit_generic (u_blitter.c:1694)
==32054==    by 0x5356819: util_blitter_blit (u_blitter.c:1813)
==32054==    by 0x544602C: lp_blit (lp_surface.c:117)
==32054==    by 0x51705F7: st_CopyTexSubImage (st_cb_texture.c:2672)
==32054==    by 0x50B2B03: copytexsubimage_by_slice (teximage.c:3459)
==32054==    by 0x50B330D: copyteximage (teximage.c:3644)
==32054==    by 0x50B3476: _mesa_CopyTexImage2D (teximage.c:3680)
==32054==    by 0x4D340E: ??? (in /usr/bin/glretrace)
==32054==    by 0x40CCCC: ??? (in /usr/bin/glretrace)
==32054==    by 0x40D2A7: ??? (in /usr/bin/glretrace)

==32054== Conditional jump or move depends on uninitialised value(s)
==32054==    at 0x5409DEE: lp_build_blend_factor_unswizzled (lp_bld_blend_aos.c:98)
==32054==    by 0x540A2A5: lp_build_blend_factor (lp_bld_blend_aos.c:262)
==32054==    by 0x540A53C: lp_build_blend_aos (lp_bld_blend_aos.c:352)
==32054==    by 0x543C200: generate_unswizzled_blend (lp_state_fs.c:2094)
==32054==    by 0x543D32D: generate_fragment (lp_state_fs.c:2434)
==32054==    by 0x543DEAC: generate_variant (lp_state_fs.c:2619)
==32054==    by 0x543F464: llvmpipe_update_fs (lp_state_fs.c:3171)
==32054==    by 0x5435D35: llvmpipe_update_derived (lp_state_derived.c:209)
==32054==    by 0x5410AC0: llvmpipe_draw_vbo (lp_draw_arrays.c:70)
==32054==    by 0x52E28C4: cso_draw_vbo (cso_context.c:1629)
==32054==    by 0x5174712: st_draw_vbo (st_draw.c:251)
==32054==    by 0x511BE24: vbo_validated_drawrangeelements (vbo_exec_array.c:844)

==32054== Conditional jump or move depends on uninitialised value(s)
==32054==    at 0x404002D: ???
==32054==    by 0x5415736: lp_rast_shade_quads_mask (lp_rast.c:457)
==32054==    by 0x541999F: lp_rast_triangle_32_3_16 (lp_rast_tri.c:346)
==32054==    by 0x5415A81: do_rasterize_bin (lp_rast.c:609)
==32054==    by 0x5415AEF: rasterize_bin (lp_rast.c:628)
==32054==    by 0x5415BFE: rasterize_scene (lp_rast.c:688)
==32054==    by 0x5415EE3: thread_function (lp_rast.c:828)
==32054==    by 0x5413C4A: impl_thrd_routine (threads_posix.h:87)
==32054==    by 0x5BF6423: start_thread (in /usr/lib/libpthread-2.23.so)
==32054==    by 0x720CCBC: clone (in /usr/lib/libc-2.23.so)

Comment 1 Brian Paul 2016-04-15 17:51:55 UTC

(In reply to David Lonie from comment #0)
> Created attachment 122975 [details]
> apitrace
> 
> There are some uninitialized variables in current master that are causing
> some of the VTK unit tests to segfault randomly, or just not produce any
> output. I've included as much information as I can think of below.
> 
> Feel free to ping me if more info is needed.
> 
> --------------------------------------------
> Configure options:
> 
>     ./autogen.sh \
>       --enable-debug \
>       --prefix="$prefix" \
>       --disable-dri \
>       --disable-egl \
>       --disable-gles1 \
>       --disable-gles2 \
>       --disable-shared-glapi \
>       --enable-xlib-glx \
>       --enable-gallium-osmesa \
>       --with-gallium-drivers=swrast \
>       --enable-gallium-llvm=yes \
>         LLVM_CONFIG=/usr/bin/llvm-config-3.6 \
>       --enable-llvm-shared-libs \
>       --with-gl-lib-name=MesaGL \
>       --with-osmesa-lib-name=MesaOSMesa
> 
> ---------------------------------------------
> Install with:
> 
> make install-data
> cd src/gallium
> make install-exec
> 
> (Regular 'make install' does this in the install dir for some reason:
> 
> $ ls mesa-master-install/lib/libMesaGL.so* -lh                              
> 
> mesa-master-install/lib/libMesaGL.so -> libMesaGL.so.1.5.0                  
> 
> mesa-master-install/lib/libMesaGL.so.1 -> libMesaGL.so.1.6.0
> mesa-master-install/lib/libMesaGL.so.1.5.0
> mesa-master-install/lib/libMesaGL.so.1.6.0
> 
> which confuses my linker/loader ;) Another bug?)
> 
> ---------------------------------------------
> memcheck the attached apitrace with (obviously change paths as needed):
> 
>     MESA_GL_VERSION_OVERRIDE=4.5 \
>     LD_LIBRARY_PATH=/ssd/src/llvm-3.8.0.install/lib \
>     LD_PRELOAD=/ssd/src/mesa-master-install/lib/libMesaGL.so \
>     valgrind --tool=memcheck \
>         glretrace vtkRenderingOpenGL2CxxTests.4075.trim.trace
> 
> ---------------------------------------------
> Sample valgrind stacks:
> ==32054== Conditional jump or move depends on uninitialised value(s)
> ==32054==    at 0x5367CF7: util_framebuffer_state_equal (u_framebuffer.c:58)
> ==32054==    by 0x5444AFE: llvmpipe_set_framebuffer_state
> (lp_state_surface.c:54)
> ==32054==    by 0x53561DA: util_blitter_blit_generic (u_blitter.c:1694)
> ==32054==    by 0x5356819: util_blitter_blit (u_blitter.c:1813)
> ==32054==    by 0x544602C: lp_blit (lp_surface.c:117)
> ==32054==    by 0x51705F7: st_CopyTexSubImage (st_cb_texture.c:2672)
> ==32054==    by 0x50B2B03: copytexsubimage_by_slice (teximage.c:3459)
> ==32054==    by 0x50B330D: copyteximage (teximage.c:3644)
> ==32054==    by 0x50B3476: _mesa_CopyTexImage2D (teximage.c:3680)
> ==32054==    by 0x4D340E: ??? (in /usr/bin/glretrace)
> ==32054==    by 0x40CCCC: ??? (in /usr/bin/glretrace)
> ==32054==    by 0x40D2A7: ??? (in /usr/bin/glretrace)

This one looks easy to fix.  Though, I wasn't able to reproduce the valgrind warning here with piglit's copytexsubimage test which definitely hits the same code path.


> ==32054== Conditional jump or move depends on uninitialised value(s)
> ==32054==    at 0x5409DEE: lp_build_blend_factor_unswizzled
> (lp_bld_blend_aos.c:98)
> ==32054==    by 0x540A2A5: lp_build_blend_factor (lp_bld_blend_aos.c:262)
> ==32054==    by 0x540A53C: lp_build_blend_aos (lp_bld_blend_aos.c:352)
> ==32054==    by 0x543C200: generate_unswizzled_blend (lp_state_fs.c:2094)
> ==32054==    by 0x543D32D: generate_fragment (lp_state_fs.c:2434)
> ==32054==    by 0x543DEAC: generate_variant (lp_state_fs.c:2619)
> ==32054==    by 0x543F464: llvmpipe_update_fs (lp_state_fs.c:3171)
> ==32054==    by 0x5435D35: llvmpipe_update_derived (lp_state_derived.c:209)
> ==32054==    by 0x5410AC0: llvmpipe_draw_vbo (lp_draw_arrays.c:70)
> ==32054==    by 0x52E28C4: cso_draw_vbo (cso_context.c:1629)
> ==32054==    by 0x5174712: st_draw_vbo (st_draw.c:251)
> ==32054==    by 0x511BE24: vbo_validated_drawrangeelements
> (vbo_exec_array.c:844)

I don't see what's wrong here.  At lp_bld_blend_aos.c:98 we're examining fields of the bld object.  But the whole bld object is initialized to zeros at line 321.


> 
> ==32054== Conditional jump or move depends on uninitialised value(s)
> ==32054==    at 0x404002D: ???
> ==32054==    by 0x5415736: lp_rast_shade_quads_mask (lp_rast.c:457)
> ==32054==    by 0x541999F: lp_rast_triangle_32_3_16 (lp_rast_tri.c:346)
> ==32054==    by 0x5415A81: do_rasterize_bin (lp_rast.c:609)
> ==32054==    by 0x5415AEF: rasterize_bin (lp_rast.c:628)
> ==32054==    by 0x5415BFE: rasterize_scene (lp_rast.c:688)
> ==32054==    by 0x5415EE3: thread_function (lp_rast.c:828)
> ==32054==    by 0x5413C4A: impl_thrd_routine (threads_posix.h:87)
> ==32054==    by 0x5BF6423: start_thread (in /usr/lib/libpthread-2.23.so)
> ==32054==    by 0x720CCBC: clone (in /usr/lib/libc-2.23.so)

Not sure about this one.  I suspect our jitted code does some vector ops on some unused elements and that triggers the valgrind warning, but that should be harmless.  Roland??

Comment 2 Roland Scheidegger 2016-04-15 18:38:56 UTC

(In reply to Brian Paul from comment #1)
> (In reply to David Lonie from comment #0)
> > ==32054== Conditional jump or move depends on uninitialised value(s)
> > ==32054==    at 0x5409DEE: lp_build_blend_factor_unswizzled
> > (lp_bld_blend_aos.c:98)
> > ==32054==    by 0x540A2A5: lp_build_blend_factor (lp_bld_blend_aos.c:262)
> > ==32054==    by 0x540A53C: lp_build_blend_aos (lp_bld_blend_aos.c:352)
> > ==32054==    by 0x543C200: generate_unswizzled_blend (lp_state_fs.c:2094)
> > ==32054==    by 0x543D32D: generate_fragment (lp_state_fs.c:2434)
> > ==32054==    by 0x543DEAC: generate_variant (lp_state_fs.c:2619)
> > ==32054==    by 0x543F464: llvmpipe_update_fs (lp_state_fs.c:3171)
> > ==32054==    by 0x5435D35: llvmpipe_update_derived (lp_state_derived.c:209)
> > ==32054==    by 0x5410AC0: llvmpipe_draw_vbo (lp_draw_arrays.c:70)
> > ==32054==    by 0x52E28C4: cso_draw_vbo (cso_context.c:1629)
> > ==32054==    by 0x5174712: st_draw_vbo (st_draw.c:251)
> > ==32054==    by 0x511BE24: vbo_validated_drawrangeelements
> > (vbo_exec_array.c:844)
> 
> I don't see what's wrong here.  At lp_bld_blend_aos.c:98 we're examining
> fields of the bld object.  But the whole bld object is initialized to zeros
> at line 321.

The problem is we're later assigning it the src1_alpha value coming from the fs. And this one is only initialized if we have dual source blend.
I'll see if I can fix this easily, as the warning is annoying (I've seen it before), however it is definitely 100% harmless and will not cause any issues whatsoever (the conditional assignment may rely on uninitialized values, however the assigned value will be unused anyway).

> 
> > 
> > ==32054== Conditional jump or move depends on uninitialised value(s)
> > ==32054==    at 0x404002D: ???
> > ==32054==    by 0x5415736: lp_rast_shade_quads_mask (lp_rast.c:457)
> > ==32054==    by 0x541999F: lp_rast_triangle_32_3_16 (lp_rast_tri.c:346)
> > ==32054==    by 0x5415A81: do_rasterize_bin (lp_rast.c:609)
> > ==32054==    by 0x5415AEF: rasterize_bin (lp_rast.c:628)
> > ==32054==    by 0x5415BFE: rasterize_scene (lp_rast.c:688)
> > ==32054==    by 0x5415EE3: thread_function (lp_rast.c:828)
> > ==32054==    by 0x5413C4A: impl_thrd_routine (threads_posix.h:87)
> > ==32054==    by 0x5BF6423: start_thread (in /usr/lib/libpthread-2.23.so)
> > ==32054==    by 0x720CCBC: clone (in /usr/lib/libc-2.23.so)
> 
> Not sure about this one.  I suspect our jitted code does some vector ops on
> some unused elements and that triggers the valgrind warning, but that should
> be harmless.  Roland??

This one is actually really annoying, as it turns up everywhere and thus makes valgrind output spammy. I'm quite convinced it's completely harmless, but so far I've been unable to eliminate it...

But in any case, I HIGHLY doubt these two are the reason for any random segfaults, I certainly don't see any evidence here. So, a backtrace of the actual crash would probably be more useful.

Comment 3 Emil Velikov 2016-04-16 00:17:51 UTC

(In reply to David Lonie from comment #0)

> (Regular 'make install' does this in the install dir for some reason:
> 
> $ ls mesa-master-install/lib/libMesaGL.so* -lh                              
> 
> mesa-master-install/lib/libMesaGL.so -> libMesaGL.so.1.5.0                  
> 
> mesa-master-install/lib/libMesaGL.so.1 -> libMesaGL.so.1.6.0
> mesa-master-install/lib/libMesaGL.so.1.5.0
> mesa-master-install/lib/libMesaGL.so.1.6.0
> 
> which confuses my linker/loader ;) Another bug?)
> 
Indeed it is - your colleague (?) Chuck Atkins is working on that one. See bug#94086.

Comment 4 Roland Scheidegger 2016-04-16 00:23:52 UTC

(In reply to Roland Scheidegger from comment #2)
> > > ==32054== Conditional jump or move depends on uninitialised value(s)
> > > ==32054==    at 0x404002D: ???
> > > ==32054==    by 0x5415736: lp_rast_shade_quads_mask (lp_rast.c:457)
> > > ==32054==    by 0x541999F: lp_rast_triangle_32_3_16 (lp_rast_tri.c:346)
> > > ==32054==    by 0x5415A81: do_rasterize_bin (lp_rast.c:609)
> > > ==32054==    by 0x5415AEF: rasterize_bin (lp_rast.c:628)
> > > ==32054==    by 0x5415BFE: rasterize_scene (lp_rast.c:688)
> > > ==32054==    by 0x5415EE3: thread_function (lp_rast.c:828)
> > > ==32054==    by 0x5413C4A: impl_thrd_routine (threads_posix.h:87)
> > > ==32054==    by 0x5BF6423: start_thread (in /usr/lib/libpthread-2.23.so)
> > > ==32054==    by 0x720CCBC: clone (in /usr/lib/libc-2.23.so)
> > 
> > Not sure about this one.  I suspect our jitted code does some vector ops on
> > some unused elements and that triggers the valgrind warning, but that should
> > be harmless.  Roland??
> 
> This one is actually really annoying, as it turns up everywhere and thus
> makes valgrind output spammy. I'm quite convinced it's completely harmless,
> but so far I've been unable to eliminate it...

To elaborate on that, the problem is that some of those complaints from valgrind may be valid, but others are not. And thus the output isn't all that useful. It is possible it's due to a legit bug - in which case rendering might be wrong however it still should not cause segfaults (and I didn't see that with this replay, just bogus rendering).

FWIW the trace has some issues, namely it requests a 4.5 context thus needs overrides to run. Not sure if this actually causes problems.
I actually tried to give the replay a look, but the problem is that so far I've always failed to get apitrace + valgrind + gdb to work. The problem is I can't get into gdb on errors on forked processes. valgrind --trace-children=yes works fine, but if you use that with --vgdb=yes --vgdb-error=0 what happens is this works on first (useless) process just fine. However, valgrind will ask to attach a second gdb on the fork, which immediately claims remote connection closed after the "target remote..." command. It will actually ask again after that happens, but if you retry it gdb just complains it can't access memory at some addresses and that's it - on errors valgrind just waits on that gdb connection which doesn't exist... If someone knows if and how this is somehow supposed to work that would be helpful...
I think this actually used to work with the --db-enable option but this one is gone for good now.

Comment 5 Roland Scheidegger 2016-04-16 00:38:31 UTC

(In reply to Roland Scheidegger from comment #4)
> I actually tried to give the replay a look, but the problem is that so far
> I've always failed to get apitrace + valgrind + gdb to work. The problem is
> I can't get into gdb on errors on forked processes.
Forget this it actually works. I'll see if I can find what caused this specific complaint, but I wouldn't count on it this being the real problem.

Comment 6 Roland Scheidegger 2016-04-16 05:13:18 UTC

(In reply to Roland Scheidegger from comment #5)
> (In reply to Roland Scheidegger from comment #4)
> > I actually tried to give the replay a look, but the problem is that so far
> > I've always failed to get apitrace + valgrind + gdb to work. The problem is
> > I can't get into gdb on errors on forked processes.
> Forget this it actually works. I'll see if I can find what caused this
> specific complaint, but I wouldn't count on it this being the real problem.

So, I've tracked this down a bit - the jne instruction where valgrind complains is actually the result of a ptest instruction. This is due to lp_build_mask_check() which tests if not all pixels have been killed (if they have, it would skip execution of the rest of the shader).
This "pixel alive" mask is always initialized fully, so initially not undefined (the plane evaluation generated this essentially figuring out which pixels in a 2x2 pixel quad belong to the current triangle).
BUT, the shader does something along the lines of:
IF whatever
   temp0.x = something
ENDIF
KILL_IF temp0.x

Interestingly, we actually do zero-intialize all temp regs, thus things still have to be defined, even if that temp0.x assignment didn't assign all channels according to the exec mask. The kill_if will then combine the pixel alive mask with the mask generated from temp0.x < 0 comparison, then do the lp_build_mask_check() using this mask which triggers the warning.

Ultimately, I do not really know why valgrind thinks it's undefined. However, to my knowledge valgrind can track definedness of values at a bit level (which is quite an amazing feat indeed), but NOT for sse regs. There's some warnings it might be less accurate there.
(FWIW --track-origins=yes simply points to lp_rast_arg_triangle_contained (lp_rast.h:220))
Theoretically, of course it is also possible the undefinedness directly comes from a buggy application itself (think something simple like glUniform call using an undefined value in the first place).
But I've seen this warning so many times and everything still worked that I do not believe valgrind complaints which are directed toward mask checks point to any real problem (and I have doubts they are even valid).

So, I'm going to mark this bug as fixed. Two minor issues have been addressed (well actually the fb one could have real consequences resulting in unnecessary state updates), and I'm not seeing any random segfaults in any case. Even if valgrind is right about the uninitialized values in the jit code that won't cause crashes (valgrind would say invalid read/write for anything which could crash). Feel free to open a new bug if you see crashes (preferably with a backtrace) or misrenderings.

Comment 7 David Lonie 2016-04-18 17:56:12 UTC

> Comment # 1 on bug 94955 from Brian Paul
> (In reply to David Lonie from comment #0)
> > ==32054== Conditional jump or move depends on uninitialised value(s)
> > ==32054==    at 0x5367CF7: util_framebuffer_state_equal (u_framebuffer.c:58)
> > ==32054==    by 0x5444AFE: llvmpipe_set_framebuffer_state
> > (lp_state_surface.c:54)
> > ==32054==    by 0x53561DA: util_blitter_blit_generic (u_blitter.c:1694)
> > ==32054==    by 0x5356819: util_blitter_blit (u_blitter.c:1813)
> > ==32054==    by 0x544602C: lp_blit (lp_surface.c:117)
> > ==32054==    by 0x51705F7: st_CopyTexSubImage (st_cb_texture.c:2672)
> > ==32054==    by 0x50B2B03: copytexsubimage_by_slice (teximage.c:3459)
> > ==32054==    by 0x50B330D: copyteximage (teximage.c:3644)
> > ==32054==    by 0x50B3476: _mesa_CopyTexImage2D (teximage.c:3680)
> > ==32054==    by 0x4D340E: ??? (in /usr/bin/glretrace)
> > ==32054==    by 0x40CCCC: ??? (in /usr/bin/glretrace)
> > ==32054==    by 0x40D2A7: ??? (in /usr/bin/glretrace)
>
> This one looks easy to fix.  Though, I wasn't able to reproduce the valgrind
> warning here with piglit's copytexsubimage test which definitely hits the same
> code path.

I've poked at it, and this patch seems to do the trick for me:

--- a/src/gallium/auxiliary/util/u_blitter.c
+++ b/src/gallium/auxiliary/util/u_blitter.c
@@ -1573,6 +1573,8 @@ void util_blitter_blit_generic(struct blitter_context *blitter,
    fb_state.nr_cbufs = blit_depth || blit_stencil ? 0 : 1;
    fb_state.cbufs[0] = NULL;
    fb_state.zsbuf = NULL;
+   fb_state.samples = 0;
+   fb_state.layers = 0;
 
    if (blit_depth || blit_stencil) {
       pipe->bind_blend_state(pipe, ctx->blend[0][0]);

> Comment # 2 on bug 94955 from Roland Scheidegger
> (In reply to Brian Paul from comment #1)
> > (In reply to David Lonie from comment #0)
> But in any case, I HIGHLY doubt these two are the reason for any random
> segfaults, I certainly don't see any evidence here. So, a backtrace of the
> actual crash would probably be more useful.

Glad the other two seem harmless. If these aren't likely to cause a segfault I'll keep poking around. The backtrace is difficult to obtain because by the time it crashes the stack is corrupt and the resulting backtrace is meaningless.

Since the issue is a stack corruption, it makes it difficult to step through, since the crashes happen somewhat randomly, and when it does all debugger state is lost since the stack is nonsense. Hence why I thought the memory errors would be at fault, but it looks like this is going to be trickier and more subtle than it seemed.

I'll keep looking and update when I find something else.

> Comment # 3 on bug 94955 from Emil Velikov
> (In reply to David Lonie from comment #0)
> > which confuses my linker/loader ;) Another bug?)
> > 
> Indeed it is - your colleague (?) Chuck Atkins is working on that one. See
> bug#94086.

So he is! Glad it's a known issue, that one took some work to track down ;)

> Comment # 4 on bug 94955 from Roland Scheidegger
> FWIW the trace has some issues, namely it requests a 4.5 context thus needs
> overrides to run. Not sure if this actually causes problems.

The 4.5 override was needed for my apitrace replay context. The actual segfaults are happening on a machine using a different (I believe 3.2?) context.

Is there information around that details how to get a better apitrace for you folks? I have another segfaulting test I could capture for you.

> Comment # 6 on bug 94955 from Roland Scheidegger
> So, I'm going to mark this bug as fixed. Two minor issues have been addressed
> (well actually the fb one could have real consequences resulting in unnecessary
> state updates), and I'm not seeing any random segfaults in any case. Even if
> valgrind is right about the uninitialized values in the jit code that won't
> cause crashes (valgrind would say invalid read/write for anything which could
> crash). Feel free to open a new bug if you see crashes (preferably with a
> backtrace) or misrenderings.

Sounds good to me. Thanks so much for the fast replies and thorough checking of the valgrind reports, even if they turned out to be bogus!

Comment 8 Roland Scheidegger 2016-04-18 18:30:49 UTC

(In reply to David Lonie from comment #7)
> > Comment # 4 on bug 94955 from Roland Scheidegger
> > FWIW the trace has some issues, namely it requests a 4.5 context thus needs
> > overrides to run. Not sure if this actually causes problems.
> 
> The 4.5 override was needed for my apitrace replay context. The actual
> segfaults are happening on a machine using a different (I believe 3.2?)
> context.
> 
> Is there information around that details how to get a better apitrace for
> you folks? I have another segfaulting test I could capture for you.
> 
The problem is, if the trace requires version override, it is difficult to tell if it actually isn't using any functions which are actually known to not really work.

Comment 9 David Lonie 2016-04-18 18:39:45 UTC

Reopening -- I may have found it. This other failing test looks better for tracking this down. It segfaults reproducibly with this stack trace from valgrind:

==19776==  Address 0x207abe3c is not stack'd, malloc'd or (recently) free'd
==19776== 
==19776== 
==19776== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==19776==  Access not within mapped region at address 0x207ABE3C
==19776==    at 0x4039042: ???
==19776==    by 0x113B377A: lp_rast_shade_quads_mask (lp_rast.c:457)
==19776==    by 0x113B8E5A: do_block_4_2 (lp_rast_tri_tmp.h:67)
==19776==    by 0x113B9194: do_block_16_2 (lp_rast_tri_tmp.h:152)
==19776==    by 0x113B96DC: lp_rast_triangle_2 (lp_rast_tri_tmp.h:305)
==19776==    by 0x113B3AC5: do_rasterize_bin (lp_rast.c:609)
==19776==    by 0x113B3B33: rasterize_bin (lp_rast.c:628)
==19776==    by 0x113B3C42: rasterize_scene (lp_rast.c:688)
==19776==    by 0x113B3F27: thread_function (lp_rast.c:828)
==19776==    by 0x113B1C8E: impl_thrd_routine (threads_posix.h:87)
==19776==    by 0x11C7A423: start_thread (in /usr/lib/libpthread-2.23.so)
==19776==    by 0x4F1DCBC: clone (in /usr/lib/libc-2.23.so)

This looks similar to the third trace in my original bug, but it's an access violation instead of an uninitialized value, and some of the frames are different.

The core dump valgrind generates is corrupted, unfortunately:

(gdb) bt
#0  0x0000000004039042 in ?? ()
#1  0x3ebb67ae3ebb67ae in ?? ()
#2  0x3ebb67ae3ebb67ae in ?? ()
#3  0x3ebb67ae3ebb67ae in ?? ()
#4  0x3ebb67ae3ebb67ae in ?? ()
#5  0x0000000000000000 in ?? ()

Running in gdb gives a similar backtrace. Tried getting an apitrace, but the segfault is preventing that from producing anything meaningful, too.

Hopefully that stack will be enough...I wish I could get you guys a useful apitrace or core dump to inspect, but this crash is doing a good job of covering its tracks! Best I can do is provide instructions for running the VTK test that reproduces it:

1) git clone https://gitlab.kitware.com/vtk/vtk.git
2) mkdir vtk-build
3) cd vtk-build
4) cmake ../vtk \
     -DOPENGL_INCLUDE_DIR=/path/to/mesa/install/prefix/include \
     -DOPENGL_gl_LIBRARY=/path/to/mesa/install/prefix/lib/libMesaGL.so \
     -DOPENGL_glu_LIBRARY=""
5) make

To run the test, either:

ctest -R TestTextureRGBADepthPeeling

or (to get around the ctest launcher for gdb/valgrind):

bin/vtkRenderingCoreCxxTests "TestTextureRGBADepthPeeling" "-D" "ExternalData/Testing" "-T" "Testing/Temporary" "-V" "ExternalData/Rendering/Core/Testing/Data/Baseline/TestTextureRGBADepthPeeling.png

Comment 10 Bruce Cherniak 2016-04-21 19:32:56 UTC

Does the gallium llvm sampler fully support float textures?  Both llvmpipe and OpenSWR generate the same sampler and fail identically.

I'll attach the sampler ir, if it's useful.  Any hints on debugging sampler jit are appreciated. :-)

Comment 11 Bruce Cherniak 2016-04-21 19:36:31 UTC

Created attachment 123134 [details]
tex_sample_func

Here's the ir for the jit'd sampler.  I forced "use_tex_func" so it didn't inline with the FS, although it fails the same either way.

Comment 12 Roland Scheidegger 2016-04-21 20:20:19 UTC

(In reply to Bruce Cherniak from comment #10)
> Does the gallium llvm sampler fully support float textures?  Both llvmpipe
> and OpenSWR generate the same sampler and fail identically.
Yes, that's easily supported (it's actually one of the most easy formats code-wise). The sampler in question doesn't even use any filtering, so compared to some other generated sampler funcs it's dead simple.
Forcing tex func is actually a really good idea to separate things out (even more so when looking at the assembly), albeit with a float type it should not have been necessary (the logic there what's complex or not is quite flawed actually...)

Though it doesn't look like float format to me. More like D24X8.

> 
> I'll attach the sampler ir, if it's useful.  Any hints on debugging sampler
> jit are appreciated. :-)
Any evidence it actually crashed there? Kind of difficult to debug if you don't get a valid address...
My guess would be it's sampling the system (so not fbo) depth buffer, and the buffer disappeared in the meantime (if it really is crashing there)...
In that case, it might be similar to bug 94522.

Comment 13 Roland Scheidegger 2016-04-21 23:25:29 UTC

Actually it is crashing in sampling, but it's a different sampler. Looks like a TXF, there _might_ be a problem with mip level calculation as it hits a seldomly used path...

Comment 14 Bruce Cherniak 2016-04-22 00:07:34 UTC

Created attachment 123136 [details]
gallivm_debug shaders (inline sampler)

I got distracted trying to get tex func to dump disassembly. :-$
Seems that in gallivm_compile_module LLVMGetPointerToGlobal returns func_code = 0 (lp_bld_init:621) and causes disassembly to crash.  But, that's another problem.  So, I switched back to an inline sampler (use_tex_func = 0).

All of the shader dump (tgsi,ir,asm) are attached.  The faulting address is within the sampler portion of the fragment shader.

(gdb) bt 2
#0  0x00007fffe9acec71 in FS ()
#1  0x00007fffe9c74bdb in BackendSingleSample<0u, 1u, 1u, 0u, 0u> (pDC=0x710780,
 workerId=0, x=56, y=56, work=..., renderBuffers=...) at ../../../../../src/gallium/drivers/swr/rasterizer/core/backend.cpp:821

(disassembly around $pc)
   0x00007fffe9acec4a <+3146>:  or     %eax,%esp
   0x00007fffe9acec4c <+3148>:  sarl   $0xc1,-0x25(%rcx)
   0x00007fffe9acec50 <+3152>:  movabs $0x7fffe9acd080,%rcx
   0x00007fffe9acec5a <+3162>:  vmovdqa (%rcx),%xmm12
   0x00007fffe9acec5e <+3166>:  vpshufb %xmm12,%xmm0,%xmm3
   0x00007fffe9acec63 <+3171>:  movslq %ebx,%r10
   0x00007fffe9acec66 <+3174>:  sar    $0x20,%rbx
   0x00007fffe9acec6a <+3178>:  movslq %edi,%rcx
   0x00007fffe9acec6d <+3181>:  sar    $0x20,%rdi
=> 0x00007fffe9acec71 <+3185>:  vmovd  (%r10,%rax,1),%xmm0
   0x00007fffe9acec77 <+3191>:  vpinsrd $0x1,(%rbx,%rax,1),%xmm0,%xmm0
   0x00007fffe9acec7e <+3198>:  vpinsrd $0x2,(%rcx,%rax,1),%xmm0,%xmm0
   0x00007fffe9acec85 <+3205>:  vpinsrd $0x3,(%rdi,%rax,1),%xmm0,%xmm2
   0x00007fffe9acec8c <+3212>:  movslq %edx,%rcx
   0x00007fffe9acec8f <+3215>:  sar    $0x20,%rdx
   0x00007fffe9acec93 <+3219>:  movslq %esi,%rdi
   0x00007fffe9acec96 <+3222>:  sar    $0x20,%rsi
   0x00007fffe9acec9a <+3226>:  vmovd  (%rcx,%rax,1),%xmm0
   0x00007fffe9acec9f <+3231>:  vpinsrd $0x1,(%rdx,%rax,1),%xmm0,%xmm0
   0x00007fffe9aceca6 <+3238>:  vpinsrd $0x2,(%rdi,%rax,1),%xmm0,%xmm0
   0x00007fffe9acecad <+3245>:  vpinsrd $0x3,(%rsi,%rax,1),%xmm0,%xmm1
   0x00007fffe9acecb4 <+3252>:  vpmovzxbw %xmm2,%xmm6
   0x00007fffe9acecb9 <+3257>:  vpmovzxbw %xmm1,%xmm14
   0x00007fffe9acecbe <+3262>:  vpsubw %xmm6,%xmm14,%xmm14
   0x00007fffe9acecc2 <+3266>:  vpmovzxbw %xmm3,%xmm0
   0x00007fffe9acecc7 <+3271>:  vpmullw %xmm0,%xmm14,%xmm7
   0x00007fffe9aceccb <+3275>:  vpsrlw $0x8,%xmm7,%xmm7
   0x00007fffe9acecd0 <+3280>:  vpaddw %xmm6,%xmm7,%xmm6
   0x00007fffe9acecd4 <+3284>:  vxorps %xmm5,%xmm5,%xmm5
   0x00007fffe9acecd8 <+3288>:  vpunpckhbw %xmm5,%xmm3,%xmm7
   0x00007fffe9acecdc <+3292>:  vpunpckhbw %xmm5,%xmm2,%xmm2
   0x00007fffe9acece0 <+3296>:  vpunpckhbw %xmm5,%xmm1,%xmm1
   0x00007fffe9acece4 <+3300>:  vxorps %xmm10,%xmm10,%xmm10

(gdb) p/x $r10
$2 = 0x20023fc
(gdb) p/x $rax
$3 = 0x364f6c0
(gdb) p/x $xmm0
$4 = {v4_float = {0x0, 0x0, 0x0, 0x0},
      v2_double = {0x0, 0x0},
      v16_int8 = {0x80, 0x0, 0x0, 0x0, 0x80, 0x0, 0x0, 0x0, 0x80, 0x0, 0x0, 0x0, 0x80, 0x0, 0x0, 0x0},
      v8_int16 = {0x80, 0x0, 0x80, 0x0, 0x80, 0x0, 0x80, 0x0},
      v4_int32 = {0x80, 0x80, 0x80, 0x80},
      v2_int64 = {0x8000000080, 0x8000000080},
      uint128 = 0x0000008000000 0800000008000000080}

(gdb) i r
rax            0x364f6c0 	56948416
rbx            0x20023fc 	33563644
rcx            0x20023fc 	33563644
rdx            0x200240002002400        144154770528019456
rsi            0x200240002002400        144154770528019456
rdi            0x20023fc        33563644
rbp            0x7fffffffa010   0x7fffffffa010
rsp            0x7fffffff9d20   0x7fffffff9d20
r8             0x1fffffc01fffffc        144115170929541116
r9             0x200000002000000        144115188109410304
r10            0x20023fc 	33563644
r11            0x1fffffc01fffffc        144115170929541116
r12            0x0      0
r13            0x0	0
r14            0x200000002000000        144115188109410304
r15            0x7fffffffa2a0   140737488331424
rip            0x7fffe9acec71   0x7fffe9acec71 <FS+3185>
eflags         0x10206  [ PF IF RF ]
cs             0x33     51
ss             0x2b     43
ds             0x0	0
es             0x0      0
fs             0x0	0
gs             0x0      0

Comment 15 Bruce Cherniak 2016-04-22 00:39:21 UTC

Created attachment 123137 [details]
partial gallium trace

Attaching the gallium trace (at least up until the crash).  I'm still looking through it, but the 2 resources being sampled are:

3948 pipe_context::create_sampler_view(pipe = 0x01073190, resource = 0x03fe46b0, templ = {format = PIPE_FORMAT_R8G8B8A8_UNORM, u = {tex = {first_layer = 0, last_layer = 0, first_level = 0, last_level = 0}}, swizzle_r = 0, swizzle_g = 1, swizzle_b = 2, swizzle_a = 3}) = 0x03f72690 // time 9
3949 pipe_context::create_sampler_view(pipe = 0x01073190, resource = 0x02be7930, templ = {format = PIPE_FORMAT_R32G32_FLOAT, u = {tex = {first_layer = 0, last_layer = 0, first_level = 0, last_level = 0}}, swizzle_r = 0, swizzle_g = 1, swizzle_b = 4, swizzle_a = 3}) = 0x03fcb740 // time 7

Comment 16 Bruce Cherniak 2016-04-22 00:56:40 UTC

BTW, for my edification, what indicates it might be a bug in the mip level calculation?

Comment 17 Roland Scheidegger 2016-04-22 01:01:28 UTC

(In reply to Bruce Cherniak from comment #11)
> Created attachment 123134 [details]
> tex_sample_func
> 
> Here's the ir for the jit'd sampler.  I forced "use_tex_func" so it didn't
> inline with the FS, although it fails the same either way.

Actually for me the sampler which crashes is a bog-standard rgba8 one, with linear filtering, repeat wrap, npot.
I think I found the problem, the coords are all nans (might just be the shader doing this), and the wrap code isn't quite robust against that. Nasty.

Comment 18 Roland Scheidegger 2016-04-22 01:06:50 UTC

(In reply to Bruce Cherniak from comment #16)
> BTW, for my edification, what indicates it might be a bug in the mip level
> calculation?

That was just a guess as it hit the per-pixel mip computations, which was odd - in fact the texture was explicitly marked as having one level only, so I figured maybe there's something wrong with that logic there, as there was plenty of unnecessary instructions (8 shifts, 8 max, some more - despite that we actually know the result of all this has to be zero anyway). But it's kinda expected actually so was correct, albeit some day might want to improve that (and was in different sampling code, I totally missed that...).

Comment 19 Bruce Cherniak 2016-04-22 19:22:36 UTC

Thanks Roland!  You patch fixes the segv I get when running both llvmpipe and OpenSWR; and the test itself now passes on llvmpipe.

Comment 20 Roland Scheidegger 2016-04-22 20:18:05 UTC

(In reply to Bruce Cherniak from comment #19)
> Thanks Roland!  You patch fixes the segv I get when running both llvmpipe
> and OpenSWR; and the test itself now passes on llvmpipe.

I wonder what the test expects, I haven't seen anything but NaN coords with that sampler ;-). That can't have possibly given really useful results, unless not crashing is good enough to pass :-).

Comment 21 David Lonie 2016-04-26 15:11:30 UTC

Thanks for the fix! 

> Comment # 20 on bug 94955 from Roland Scheidegger
> (In reply to Bruce Cherniak from comment #19)
> > Thanks Roland!  You patch fixes the segv I get when running both llvmpipe
> > and OpenSWR; and the test itself now passes on llvmpipe.

The segfault is indeed gone, so that's a definite improvement!

> I wonder what the test expects, I haven't seen anything but NaN coords with
> that sampler ;-). That can't have possibly given really useful results, unless
> not crashing is good enough to pass :-).

The test is not passing, but it is rendering a portion of the scene (the part that doesn't use floating point textures).

What's interesting is that mesa is reporting ARB_texture_float, even though I've not configured with --enable-texture-float, so that's likely part of the problem. I also see error messages popping out from mesa at runtime:

Illegal sampler view creation without bind flag

Recompiling with --enable-texture-float fixed neither the rendering or the error messages, so it appears something else is still going on.

Should this be a new bug? Is there more information I can provide? This can be reproduced in VTK as described in comment #9.

Comment 22 Roland Scheidegger 2016-04-26 18:10:41 UTC

(In reply to David Lonie from comment #21)
> > I wonder what the test expects, I haven't seen anything but NaN coords with
> > that sampler ;-). That can't have possibly given really useful results, unless
> > not crashing is good enough to pass :-).
> 
> The test is not passing, but it is rendering a portion of the scene (the
> part that doesn't use floating point textures).
Ok. I guess the all nans there as coords have something to do with it...

> 
> Illegal sampler view creation without bind flag
This is "typically" harmless for now at least (especially on sw renderers), but an abuse of bind flags (though I don't know about swr, if you make use of the flags to track potential changes of the contents of resources). If a resource is created, it must have sampler view bind flag, otherwise it's illegal to later create a sampler view. The problem is, this requirement doesn't fit well with GL. Sometimes for things as blitting such resources also util code may create sampler views out of them. If you're interested it should be pretty easy to figure out where the result was created, why it doesn't have the sampler view bind flag and why later a sampler view is created regardless, but it most likely isn't the source of the issue.

> Should this be a new bug? Is there more information I can provide? This can
> be reproduced in VTK as described in comment #9.
Yes this should be a new bug (even the ones fixed actually have nothing to do with the bug title already, so let's not make this worse).
You probably need to track down what's going wrong. All I can say from when I was looking at the NaN coords, that shader seemed to source the coords directly from an input attribute, and the vs looked fairly simple, so I don't know what happened really. However, the test seemed to possibly use some kind of multi-pass operation, hence it's possible previous shaders didn't output the right result already.

Comment 23 David Lonie 2016-05-04 17:58:28 UTC

Opened a new bug:

https://bugs.freedesktop.org/show_bug.cgi?id=95266

Feel free to close this one.

Comment 24 Roland Scheidegger 2016-05-10 15:43:28 UTC

The sampler crashing due to NaN coords has been fixed by bd07e20d208268382a34dca23ff71a8192bb1525 and the test failing (presumably the reason is the coords shouldn't be NaNs) is tracked in another bug, closing.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.