Bug 95026

Summary: Alien Isolation segfault after initial loading screen/video
Product: Mesa Reporter: Christoph Haag <haagch>
Component: Drivers/Gallium/radeonsiAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact: Default DRI bug account <dri-devel>
Severity: normal    
Priority: medium    
Version: git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: layout asm, segfault is in the first line
MESA_GLSL=dump
MESA_GLSL=dump with --enable-debug
Separate non-recursive part out of visit(ir_expression) to reduce stack explosion

Description Christoph Haag 2016-04-19 20:15:23 UTC
Created attachment 123069 [details]
layout asm, segfault is in the first line

I've had this for a while during playing around with the compute shader patches, but I do not know whether they are at fault.

Running on
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Wimbledon XT [Radeon HD 7970M] (rev ff)
with mesa git and very recent llvm 3.9.

On the first start of Alien Isolation the game shows the initial loading screen and then the 20th century logo video - while playing this video, Alien Isolation segfaults. With subsequent starts of the game, it segfaults while displaying the loading screen and before getting to the 20th century logo video.

Unfortunately I can only provide this backtrace:

Thread 31 "WinMain" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffc816d700 (LWP 18766)]
0x00007fffe98a4387 in ?? () from /usr/lib/xorg/modules/dri/radeonsi_dri.so
(gdb) bt
#0  0x00007fffe98a4387 in ?? () from /usr/lib/xorg/modules/dri/radeonsi_dri.so
#1  0x0000000000000000 in ?? ()
(gdb)

because when I compile mesa with empty CFLAGS, it segfaults, but when mesa is compiled with CFLAGS="-g", then the segfault does NOT happen and the game works nicely. So there's only the assembly gdb shows with layout asm in the attachment...

When I use the same mesa build that crashes on radeonsi with intel graphics, it also does not crash, so it's something in the radeonsi driver.
Comment 1 Christoph Haag 2016-04-19 20:52:10 UTC
Aww I got it wrong. When CFLAGS is not set, mesa is compiled with this:
CFLAGS:          -g -O2 -Wall -std=c99 -Werror=implicit-function-declaration -Werror=missing-prototypes -fno-strict-aliasing -fno-math-errno -fno-trapping-math -fno-builtin-memcmp
CXXFLAGS:        -g -O2 -Wall -fno-strict-aliasing -fno-builtin-memcmp

And it still segfaults, but with -g I can of course get a backtrace:

Thread 34 "WinMain" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffc8553700 (LWP 2682)]
0x00007fffe985e557 in glsl_to_tgsi_visitor::visit (this=0x7fff3eddace0, ir=0x7fff3eebc7e8) at state_tracker/st_glsl_to_tgsi.cpp:1537
1537    {
(gdb) set logging on
Copying output to gdb.txt.
(gdb) bt full
#0  0x00007fffe985e557 in glsl_to_tgsi_visitor::visit (this=0x7fff3eddace0, ir=0x7fff3eebc7e8) at state_tracker/st_glsl_to_tgsi.cpp:1537
        operand = <optimized out>
        op = {{file = PROGRAM_TEMPORARY, index = 0, index2D = 0, swizzle = 0, negate = 0, type = 0, reladdr = 0x0, reladdr2 = 0x0, has_index2 = false, double_reg2 = false, array_id = 0,
            is_double_vertex_input = false}, {file = PROGRAM_TEMPORARY, index = 0, index2D = 0, swizzle = 0, negate = 0, type = 0, reladdr = 0x0, reladdr2 = 0x0, has_index2 = false,
            double_reg2 = false, array_id = 0, is_double_vertex_input = false}, {file = PROGRAM_TEMPORARY, index = 0, index2D = 0, swizzle = 0, negate = 0, type = 0, reladdr = 0x0,
            reladdr2 = 0x0, has_index2 = false, double_reg2 = false, array_id = 0, is_double_vertex_input = false}, {file = PROGRAM_TEMPORARY, index = 0, index2D = 0, swizzle = 0,
            negate = 0, type = 0, reladdr = 0x0, reladdr2 = 0x0, has_index2 = false, double_reg2 = false, array_id = 0, is_double_vertex_input = false}}
        result_src = <optimized out>
        result_dst = <optimized out>
        vector_elements = <optimized out>
#1  0x0000000000000000 in ?? ()
No symbol table info available.

I'll keep playing around to find out how it did actually work. Maybe it was with CFLAGS not unset, but set to ""?
Comment 2 Ilia Mirkin 2016-04-19 21:00:01 UTC
(In reply to Christoph Haag from comment #1)
> Thread 34 "WinMain" received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7fffc8553700 (LWP 2682)]
> 0x00007fffe985e557 in glsl_to_tgsi_visitor::visit (this=0x7fff3eddace0,
> ir=0x7fff3eebc7e8) at state_tracker/st_glsl_to_tgsi.cpp:1537

Try to get a dump of the shader it's trying to convert. MESA_GLSL=dump should do it, I think. If not, try mucking around in get_mesa_program to print it before it visits the IR.
Comment 3 Christoph Haag 2016-04-19 21:27:10 UTC
Created attachment 123071 [details]
MESA_GLSL=dump

I'll be damned. Now I compiled with CFLAGS="" CXXFLAGS="" and configure said it's this:

CFLAGS:          -Wall -std=c99 -Werror=implicit-function-declaration -Werror=missing-prototypes -fno-strict-aliasing -fno-math-errno -fno-trapping-math -fno-builtin-memcmp
CXXFLAGS:        -Wall -fno-strict-aliasing -fno-builtin-memcmp

and the game works fine, I started it 4 times in a row, so I am relatively sure it's not just random.
But I thought it was crashing earlier with this config.. I must really be remembering this wrong.

So a first hypothesis would probably be O2.

MESA_GLSL=dump is *really* slow, by the way. It took 9:06 minutes until it crashed, but attached is the complete log with stdout and stderr, compressed because it is 89 megabyte uncompressed...
Comment 4 Ilia Mirkin 2016-04-19 21:37:39 UTC
(In reply to Christoph Haag from comment #3)
> Created attachment 123071 [details]
> MESA_GLSL=dump
> 
> I'll be damned. Now I compiled with CFLAGS="" CXXFLAGS="" and configure said
> it's this:
> 
> CFLAGS:          -Wall -std=c99 -Werror=implicit-function-declaration
> -Werror=missing-prototypes -fno-strict-aliasing -fno-math-errno
> -fno-trapping-math -fno-builtin-memcmp
> CXXFLAGS:        -Wall -fno-strict-aliasing -fno-builtin-memcmp
> 
> and the game works fine, I started it 4 times in a row, so I am relatively
> sure it's not just random.
> But I thought it was crashing earlier with this config.. I must really be
> remembering this wrong.
> 
> So a first hypothesis would probably be O2.
> 
> MESA_GLSL=dump is *really* slow, by the way. It took 9:06 minutes until it
> crashed, but attached is the complete log with stdout and stderr, compressed
> because it is 89 megabyte uncompressed...

Well, I really only need the last shader (the one that causes the crash). Unfortunately, it seems like that information is no there. I guess it's only printed when built with --enable-debug? Can you see if that still causes crashes?
Comment 5 Christoph Haag 2016-04-19 22:02:52 UTC
Created attachment 123072 [details]
MESA_GLSL=dump with --enable-debug

With CFLAGS and CXXFLAGS not set and --enable-debug, mesa is using this:
CFLAGS:          -g -O2 -Wall -std=c99 -Werror=implicit-function-declaration -Werror=missing-prototypes -fno-strict-aliasing -fno-math-errno -fno-trapping-math -fno-builtin-memcmp
CXXFLAGS:        -g -O2 -Wall -fno-strict-aliasing -fno-builtin-memcmp

And the game crashes. Looks like +1 for O2.
Comment 6 Christoph Haag 2016-04-19 22:31:51 UTC
So I went ahead and tested my theory by setting CFLAGS and CXXFLAGS to "-O2" and then to "-O1".

First:
CFLAGS:          -O2 -Wall -std=c99 -Werror=implicit-function-declaration -Werror=missing-prototypes -fno-strict-aliasing -fno-math-errno -fno-trapping-math -fno-builtin-memcmp -g
CXXFLAGS:        -O2 -Wall -fno-strict-aliasing -fno-builtin-memcmp -g
-> game crashes

Then:
CFLAGS:          -O1 -Wall -std=c99 -Werror=implicit-function-declaration -Werror=missing-prototypes -fno-strict-aliasing -fno-math-errno -fno-trapping-math -fno-builtin-memcmp
CXXFLAGS:        -O1 -Wall -fno-strict-aliasing -fno-builtin-memcmp
-> game does NOT crash

That's with gcc (GCC) 5.3.0 on archlinux.
Comment 7 Ilia Mirkin 2016-04-19 22:34:06 UTC
(In reply to Christoph Haag from comment #5)
> Created attachment 123072 [details]
> MESA_GLSL=dump with --enable-debug
> 
> With CFLAGS and CXXFLAGS not set and --enable-debug, mesa is using this:
> CFLAGS:          -g -O2 -Wall -std=c99 -Werror=implicit-function-declaration
> -Werror=missing-prototypes -fno-strict-aliasing -fno-math-errno
> -fno-trapping-math -fno-builtin-memcmp
> CXXFLAGS:        -g -O2 -Wall -fno-strict-aliasing -fno-builtin-memcmp
> 
> And the game crashes. Looks like +1 for O2.

Can you do "i threads" when it crashes? Are any other threads compiling shaders at the same time?

It appears that the last linked fragment shader is from fragment shader 2749, which is not the last glsl shader source to have been provided. Which is odd, but not impossible. [Not 100% sure, but I think it's linked with vertex shader 2693.]

Need to run through through piglit and see what happens. However they look simple enough. Could you poke around a bit more in gdb and see if there's anything interesting?
Comment 8 Christoph Haag 2016-04-19 22:52:38 UTC
Looks like it's the only one:
Thread 32 "WinMain" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffc84b1700 (LWP 11301)]
0x00007fffe979cb77 in glsl_to_tgsi_visitor::visit (this=0x7fff3ea10aa0, ir=0x7fff3e263c40) at state_tracker/st_glsl_to_tgsi.cpp:1537
warning: Source file is more recent than executable.
1537    {
(gdb) i threads
  Id   Target Id         Frame
  1    Thread 0x7ffff7fdaec0 (LWP 11258) "AlienIsolation" 0x00007ffff6ef977d in nanosleep () from /usr/lib/libpthread.so.0
  2    Thread 0x7fffee897700 (LWP 11262) "SDLTimer" 0x00007ffff6ef85f5 in do_futex_wait () from /usr/lib/libpthread.so.0
  3    Thread 0x7fffe5a64700 (LWP 11263) "AlienIsolation" 0x00007ffff6ef603f in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
  4    Thread 0x7fffdf705700 (LWP 11264) "PulseHotplug" 0x00007ffff536ad01 in ppoll () from /usr/lib/libc.so.6
  5    Thread 0x7fffdd8c9700 (LWP 11265) "CFileWriterThre" 0x00007ffff6ef603f in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
  6    Thread 0x7fffdd35b700 (LWP 11266) "OpenGL dispatch" 0x00007ffff6ef8427 in do_futex_wait.constprop () from /usr/lib/libpthread.so.0
  7    Thread 0x7fffdcb5a700 (LWP 11267) "AlienIsolation" 0x00007ffff534361d in nanosleep () from /usr/lib/libc.so.6
  8    Thread 0x7fffd7fff700 (LWP 11268) "OpenGL dispatch" 0x00007ffff6ef8427 in do_futex_wait.constprop () from /usr/lib/libpthread.so.0
  9    Thread 0x7fffd77fe700 (LWP 11273) "WinMain" 0x00007ffff6ef8427 in do_futex_wait.constprop () from /usr/lib/libpthread.so.0
  10   Thread 0x7fffc92f6700 (LWP 11279) "OpenGL dispatch" 0x00007ffff6ef8427 in do_futex_wait.constprop () from /usr/lib/libpthread.so.0
  11   Thread 0x7fffc8af5700 (LWP 11280) "WinMain" 0x00007ffff6ef63e8 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
  12   Thread 0x7fffd415a700 (LWP 11281) "WinMain" 0x00007ffff6ef603f in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
  13   Thread 0x7fffd4109700 (LWP 11282) "WinMain" 0x00007ffff6ef603f in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
  14   Thread 0x7fffd40b8700 (LWP 11283) "WinMain" 0x00007ffff6ef603f in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
  15   Thread 0x7fff957fc700 (LWP 11284) "SDLAudioDev2" 0x00007ffff536ad01 in ppoll () from /usr/lib/libc.so.6
  16   Thread 0x7fffd4067700 (LWP 11285) "WinMain" 0x00007ffff6ef603f in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
  17   Thread 0x7fffc86b4700 (LWP 11286) "WinMain" 0x00007ffff6ef603f in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
  18   Thread 0x7fffc8663700 (LWP 11287) "WinMain" 0x00007ffff6ef603f in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
  31   Thread 0x7fffc8612700 (LWP 11300) "WinMain" 0x00007ffff6ef603f in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
* 32   Thread 0x7fffc84b1700 (LWP 11301) "WinMain" 0x00007fffe979cb77 in glsl_to_tgsi_visitor::visit (this=0x7fff3ea10aa0, ir=0x7fff3e263c40) at state_tracker/st_glsl_to_tgsi.cpp:1537
  33   Thread 0x7fffc8469700 (LWP 11302) "WinMain" 0x00007ffff6ef8427 in do_futex_wait.constprop () from /usr/lib/libpthread.so.0
  34   Thread 0x7fffc8408700 (LWP 11303) "WinMain" 0x00007ffff6ef8427 in do_futex_wait.constprop () from /usr/lib/libpthread.so.0
  36   Thread 0x7fffc835b700 (LWP 11305) "WinMain" 0x00007ffff6ef603f in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
  38   Thread 0x7fffc830a700 (LWP 11307) "WinMain" 0x00007ffff534361d in nanosleep () from /usr/lib/libc.so.6
  39   Thread 0x7fff7bffe700 (LWP 11308) "vpx decode" 0x00007ffff6ef603f in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0


I'm not exactly an expert at debugging C, but it's kinda weird that it's segfaulting on the opening brace of

void
glsl_to_tgsi_visitor::visit(ir_expression *ir)
{
   unsigned int operand;

Googling a bit sounds like it's a stack overflow: http://stackoverflow.com/a/10501490
With "disas" gdb actually allows me to see the assembler instructions before the crash. Yay.

   0x00007fffe979cb70 <+32>:    sub    $0x1000,%rsp
=> 0x00007fffe979cb77 <+39>:    orq    $0x0,(%rsp)

but not sure why:

(gdb) print ir->operands
$1 = {0x7fff3eb22770, 0x0, 0x0, 0x0}

(gdb) info locals
operand = <optimized out>
op = {{file = PROGRAM_TEMPORARY, index = 0, index2D = 0, swizzle = 0, negate = 0, type = 0, reladdr = 0x0, reladdr2 = 0x0, has_index2 = false, double_reg2 = false, array_id = 0,
    is_double_vertex_input = false}, {file = PROGRAM_TEMPORARY, index = 0, index2D = 0, swizzle = 0, negate = 0, type = 0, reladdr = 0x0, reladdr2 = 0x0, has_index2 = false,
    double_reg2 = false, array_id = 0, is_double_vertex_input = false}, {file = PROGRAM_TEMPORARY, index = 0, index2D = 0, swizzle = 0, negate = 0, type = 0, reladdr = 0x0,
    reladdr2 = 0x0, has_index2 = false, double_reg2 = false, array_id = 0, is_double_vertex_input = false}, {file = PROGRAM_TEMPORARY, index = 0, index2D = 0, swizzle = 0, negate = 0,
    type = 0, reladdr = 0x0, reladdr2 = 0x0, has_index2 = false, double_reg2 = false, array_id = 0, is_double_vertex_input = false}}
result_src = <optimized out>
result_dst = <optimized out>
__PRETTY_FUNCTION__ = "virtual void glsl_to_tgsi_visitor::visit(ir_expression*)"
vector_elements = <optimized out>
Comment 9 Ilia Mirkin 2016-04-19 23:00:41 UTC
(In reply to Christoph Haag from comment #8)
> Looks like it's the only one:

Indeed it is.

> I'm not exactly an expert at debugging C, but it's kinda weird that it's
> segfaulting on the opening brace of
> 
> void
> glsl_to_tgsi_visitor::visit(ir_expression *ir)
> {
>    unsigned int operand;
> 

Very weird. But weird things happen with -O2

> Googling a bit sounds like it's a stack overflow:
> http://stackoverflow.com/a/10501490
> With "disas" gdb actually allows me to see the assembler instructions before
> the crash. Yay.
> 
>    0x00007fffe979cb70 <+32>:    sub    $0x1000,%rsp
> => 0x00007fffe979cb77 <+39>:    orq    $0x0,(%rsp)
> 
> but not sure why:

Right, so this allocates 4K of stack, which isn't some incredibly large amount. Normally stack just gets auto-paged in, perhaps memory is running out... somehow. I don't really know enough about this :(

What is the value of %rsp? (i registers)

What compiler are you building with?
Comment 10 Christoph Haag 2016-04-19 23:01:48 UTC
As I said, gcc 5.3 on archlinux.

(gdb) i registers
rax            0x7fffe979cb50   140737110461264
rbx            0x7fff3ea10aa0   140734244129440
rcx            0x7fffe986cac0   140737111313088
rdx            0x7fff3ea10aa0   140734244129440
rsi            0x7fff3e263c40   140734236081216
rdi            0x7fff3ea10aa0   140734244129440
rbp            0x7fff3db85440   0x7fff3db85440
rsp            0x7fffc846af88   0x7fffc846af88
r8             0x1404   5124
r9             0x7fff3ea11724   140734244132644
r10            0x688    1672
r11            0x7fffc8465f88   140736553443208
r12            0x1      1
r13            0x1      1
r14            0x7fffc8476f70   140736553512816
r15            0x7fff3db85448   140734228878408
rip            0x7fffe979cb77   0x7fffe979cb77 <glsl_to_tgsi_visitor::visit(ir_expression*)+39>
eflags         0x10206  [ PF IF RF ]
cs             0x33     51
ss             0x2b     43
ds             0x0      0
es             0x0      0
fs             0x0      0
gs             0x0      0
Comment 11 Michel Dänzer 2016-04-20 03:29:41 UTC
Reminds me of bug 92850, which has some instructions for further narrowing down which gcc optimization triggers the problem. Maybe -fno-inline-small-functions even works around this one as well?
Comment 12 Christoph Haag 2016-04-20 07:27:15 UTC
Yes, you are right. With CFLAGS and CXXFLAGS="-O2 -fno-inline-small-functions" it does not crash.

Seems like the answers on stackoverflow apply here too and it's the inlined functions that use up all the available stack memory. I don't know how that works exactly but I guess all the local variables from the inlined functions get added to the functions where they are inlined into, every time?
Comment 13 Nicolai Hähnle 2016-04-25 23:40:47 UTC
Created attachment 123256 [details] [review]
Separate non-recursive part out of visit(ir_expression) to reduce stack explosion

Hi Christoph, could you try the attached patch? It fixes a similar crash for me.

It's a bit surprising that you "only" see a 4KB stack frame, because in my case, glsl_to_tgsi_visitor::visit(ir_expression *) had a ~32KB stack frame...
Comment 14 Christoph Haag 2016-04-26 10:07:51 UTC
(In reply to Nicolai Hähnle from comment #13)
> Created attachment 123256 [details] [review] [review]
> Separate non-recursive part out of visit(ir_expression) to reduce stack
> explosion
> 
> Hi Christoph, could you try the attached patch? It fixes a similar crash for
> me.

Yes it helps, there is no crash with the Patch.
Comment 15 Nicolai Hähnle 2016-04-29 16:54:00 UTC
commit 98c348d26b28a662d093543ecb7ca839e7883e8e

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct.