Bug 105497 - shader-db crashes on 72 core system after ast_type_qualifier bitset change
Summary: shader-db crashes on 72 core system after ast_type_qualifier bitset change
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: glsl-compiler (show other bugs)
Version: git
Hardware: Other All
: medium normal
Assignee: Francisco Jerez
QA Contact: Intel 3D Bugs Mailing List
URL:
Whiteboard:
Keywords: bisected, regression
Depends on:
Blocks:
 
Reported: 2018-03-14 06:19 UTC by Kenneth Graunke
Modified: 2018-05-09 11:53 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments

Description Kenneth Graunke 2018-03-14 06:19:33 UTC
shader-db crashes on our 72-core server (ogopogo):

Thread 38 "run" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffdf3f8700 (LWP 16090)]
ast_type_qualifier::validate_in_qualifier (this=this@entry=0x7fffdf3ec728, 
    loc=loc@entry=0x7fffdf3eb3b4, state=state@entry=0x7ffed491fde0)
    at glsl/ast_type.cpp:653
653	   if ((this->flags.i & ~valid_in_mask.flags.i) != 0) {
(gdb) bt
#0  ast_type_qualifier::validate_in_qualifier (this=this@entry=0x7fffdf3ec728, 
    loc=loc@entry=0x7fffdf3eb3b4, state=state@entry=0x7ffed491fde0)
    at glsl/ast_type.cpp:653
#1  0x00007ffff416b901 in _mesa_glsl_parse (state=state@entry=0x7ffed491fde0)
    at ./glsl/glsl_parser.yy:2866
#2  0x00007ffff41758d6 in _mesa_glsl_compile_shader (
    ctx=ctx@entry=0x7ffed4047280, shader=shader@entry=0x7ffed40b7fa0, 
    dump_ast=dump_ast@entry=false, dump_hir=dump_hir@entry=false, 
    force_recompile=force_recompile@entry=false)
    at glsl/glsl_parser_extras.cpp:2089
#3  0x00007ffff3f6845f in _mesa_compile_shader (sh=0x7ffed40b7fa0, 
    ctx=0x7ffed4047280) at main/shaderapi.c:1129
#4  _mesa_CreateShaderProgramv (type=<optimised out>, count=<optimised out>, 
    strings=0x7fffdf3f6e40) at main/shaderapi.c:2510
#5  0x0000000000403521 in main._omp_fn.0 () at run.c:814
#6  0x00007ffff74be43e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#7  0x00007ffff72996ba in start_thread (arg=0x7fffdf3f8700)
    at pthread_create.c:333
#8  0x00007ffff6fcf41d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb) 

It seems to be some kind of concurrency issue caused by

commit ba79a90fb52e1e81fbfb38113e85a56b13497c50
Author: Francisco Jerez <currojerez@riseup.net>
Date:   Mon Feb 12 14:18:15 2018 -0800

    glsl: Switch ast_type_qualifier to a 128-bit bitset.

At that commit, plus the compile fix (e119c53f56064d8acecfcd22a15ba512c83dfcd0) cherry-picked on top, I get the crashes.  But commit bdbc2ffa4219b39e47a27decbc603d445286d92d (the one before that) works fine.
Comment 1 Francisco Jerez 2018-03-14 07:10:27 UTC
Somehow not hugely surprised, the line where the crash occurs is arguably relying on undefined behavior, because it's accessing an inactive union member of potentially both this->flags and valid_in_mask.flags.  This was however already UB before my patch -- Funny thing about UB is things may seem to work fine until a harmless-looking change lands that just happens to change the behavior of some compiler versions, inexplicably leading to data corruption.

What compiler was used to build mesa?  Where exactly did it crash?  Can you show me the x86 assembly and register dumps?  Why do you think it is a concurrency issue?  Aren't the data structures involved single-threaded at this point of the program?
Comment 2 Francisco Jerez 2018-03-15 23:44:11 UTC
Thanks for letting me reproduce this on ogopogo.  It seems to be crashing at this instruction:

   0x00007ffff41106d9 <+169>:   movaps %xmm0,-0x130(%rbp)
=> 0x00007ffff41106e0 <+176>:   pand   (%rbx),%xmm0
   0x00007ffff41106e4 <+180>:   movaps %xmm0,-0x120(%rbp)

GCC 5.4 seems to be vectorizing the 128-bit bitmask operation, but the memory address of "this" is not 128-bit aligned (it's 0x7fffe8bff728 during this run), which causes the crash.  Probably a GCC bug...
Comment 3 Kenneth Graunke 2018-03-17 23:22:30 UTC
I think you're right - GCC shouldn't be doing this.  GCC 7.3.0 with -O3 seems to work fine.  GCC 5.4 with -O2 seems to work fine.  So it looks like a bug with GCC 5.4 and -O3 incorrectly vectorizing things.

Closing as NOTOURBUG.
Comment 4 Brian Paul 2018-05-08 14:42:36 UTC
I'm posting a patch to work-around this issue.  gcc 5.4 is pretty common and this bug could be encountered by a lot of people.

For the record, the bug can be repro'd with the Piglit tests/spec/glsl-1.50/execution/varying-struct-basic-gs-fs.shader_test test when Mesa's compiled with -O3.
Comment 5 Brian Paul 2018-05-09 11:53:50 UTC
Fixed with 901db25d5b7cd2ac2dd648b370c4bddf23dd5c44


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.