Bug 109304 - GfxBench AztecRuins Vulkan version Segfault
Summary: GfxBench AztecRuins Vulkan version Segfault
Status: VERIFIED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Vulkan/intel (show other bugs)
Version: git
Hardware: Other All
: medium normal
Assignee: Intel 3D Bugs Mailing List
QA Contact: Intel 3D Bugs Mailing List
URL:
Whiteboard:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2019-01-11 09:41 UTC by Eero Tamminen
Modified: 2019-01-14 08:35 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments

Description Eero Tamminen 2019-01-11 09:41:41 UTC
Setup:
* Ubuntu 18.04
* Latest Mesa from git
* Latest X or Weston from git
* drm-tip git kernel v4.20 or newer
* GfxBench v5 (X or Wayland build, GOLD2 or GOLD3 version)

Test-case:
* Run AztecRuins (normal or high) Vulkan version:
bin/testfw_app --gfx vulkan --gl_api vulkan --width 1920 --height 1080 --fullscreen 1 --test_id vulkan_5_normal

Result:
* Segfault when run from Gdb (abort when run without as GfxBench catches them and aborts):
--------------------------------------
...
compile light_directional_cube.frag...
done

SPIR-V WARNING:
    In file ../../../../../source2/_repos/git___anongit_freedesktop_org_git_mesa_mesa/src/compiler/spirv/vtn_variables.c:2258
    OpStore of a sampler detected.  Doing on-the-fly copy propagation to workaround the problem.
    9408 bytes into the SPIR-V binary

Thread 2 "testfw_app" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff3339700 (LWP 10251)]
--------------------------------------

This doesn't happen with GL or GLES versions, only Vulkan.  This regression started between following commits:
* 2019-01-07 17:07:33 0cc01f45e7: glsl: correct typo in GLSL compilation error message
* 2019-01-08 17:51:46 d0c6ef2793: nir: rename global/local to private/function memory


Valgrind says following:
--------------------------------------
SPIR-V WARNING:
    In file src/compiler/spirv/vtn_variables.c:2258
    OpStore of a sampler detected.  Doing on-the-fly copy propagation to workaround the problem.
    9408 bytes into the SPIR-V binary
==10457== Invalid read of size 4
==10457==    at 0xA54E458: try_lower_tex_ycbcr (anv_nir_lower_ycbcr_textures.c:328)
==10457==    by 0xA54E458: anv_nir_lower_ycbcr_textures (anv_nir_lower_ycbcr_textures.c:464)
==10457==    by 0xA55372F: anv_pipeline_lower_nir (anv_pipeline.c:500)
==10457==    by 0xA555FEE: anv_pipeline_compile_graphics (anv_pipeline.c:1034)
==10457==    by 0xA5577BB: anv_pipeline_init (anv_pipeline.c:1463)
==10457==    by 0xA676675: gen9_graphics_pipeline_create (genX_pipeline.c:1787)
==10457==    by 0xA676675: gen9_CreateGraphicsPipelines (genX_pipeline.c:2001)
==10457==    by 0x917FCC: ??? (in bin/testfw_app)
==10457==  Address 0x3c is not stack'd, malloc'd or (recently) free'd
==10457== 
==10457== 
==10457== Process terminating with default action of signal 6 (SIGABRT)
==10457==    at 0x7239E97: raise (raise.c:51)
==10457==    by 0x723B800: abort (abort.c:79)
==10457==    by 0x6872B5: ??? (in bin/testfw_app)
==10457==    by 0x623888F: ??? (in /lib/x86_64-linux-gnu/libpthread-2.27.so)
==10457==    by 0xA54E457: try_lower_tex_ycbcr (anv_nir_lower_ycbcr_textures.c:396)
==10457==    by 0xA54E457: anv_nir_lower_ycbcr_textures (anv_nir_lower_ycbcr_textures.c:464)
==10457==    by 0xA55372F: anv_pipeline_lower_nir (anv_pipeline.c:500)
==10457==    by 0xA555FEE: anv_pipeline_compile_graphics (anv_pipeline.c:1034)
==10457==    by 0xA5577BB: anv_pipeline_init (anv_pipeline.c:1463)
==10457==    by 0xA676675: gen9_graphics_pipeline_create (genX_pipeline.c:1787)
==10457==    by 0xA676675: gen9_CreateGraphicsPipelines (genX_pipeline.c:2001)
--------------------------------------
Comment 1 Eero Tamminen 2019-01-11 09:42:03 UTC
Gdb tells:
--------------------------------------
(gdb) bt
#0  try_lower_tex_ycbcr (tex=0x7fffef8e2430, builder=0x7ffff3327b70, layout=0x7fffef8249c0)
    at src/intel/vulkan/anv_nir_lower_ycbcr_textures.c:328
#1  anv_nir_lower_ycbcr_textures (shader=shader@entry=0x7fffee3037e0, 
    layout=layout@entry=0x7fffef8249c0)
    at src/intel/vulkan/anv_nir_lower_ycbcr_textures.c:464
#2  0x00007ffff23d7730 in anv_pipeline_lower_nir (pipeline=pipeline@entry=0x7fffef005620, 
    mem_ctx=mem_ctx@entry=0x7fffef133700, stage=stage@entry=0x7ffff33313b0, 
    layout=layout@entry=0x7fffef8249c0)
    at src/intel/vulkan/anv_pipeline.c:500
#3  0x00007ffff23d9fef in anv_pipeline_compile_graphics (pipeline=pipeline@entry=0x7fffef005620, 
    cache=cache@entry=0x7fffef0d1300, info=info@entry=0x7ffff33371d0)
    at src/intel/vulkan/anv_pipeline.c:1034
#4  0x00007ffff23db7bc in anv_pipeline_init (pipeline=pipeline@entry=0x7fffef005620, 
    device=device@entry=0x7fffec37f460, cache=cache@entry=0x7fffef0d1300, 
    pCreateInfo=pCreateInfo@entry=0x7ffff33371d0, alloc=0x7fffec37f468, alloc@entry=0x0)
    at src/intel/vulkan/anv_pipeline.c:1463
#5  0x00007ffff24fa676 in gen9_graphics_pipeline_create (pPipeline=0x7fffee3562d0, pAllocator=0x0, 
    pCreateInfo=0x7ffff33371d0, cache=0x7fffef0d1300, _device=0x7fffec37f460)
    at src/intel/vulkan/genX_pipeline.c:1787
#6  gen9_CreateGraphicsPipelines (_device=0x7fffec37f460, pipelineCache=0x7fffef0d1300, count=1, 
    pCreateInfos=<optimized out>, pAllocator=0x0, pPipelines=0x7fffee3562d0)
    at src/intel/vulkan/genX_pipeline.c:2001

(gdb) disassemble try_lower_tex_ycbcr
   0x00007ffff23d1549 <+281>:	mov    0x80(%r14),%ecx
   0x00007ffff23d1550 <+288>:	mov    0x78(%r14),%rsi
323:
   0x00007ffff23d1581 <+337>:	lea    0x44f7f8(%rip),%rcx
   0x00007ffff23d1588 <+344>:	lea    0x44f489(%rip),%rsi
   0x00007ffff23d158f <+351>:	lea    0x44f52a(%rip),%rdi
   0x00007ffff23d1596 <+358>:	mov    $0x143,%edx
   0x00007ffff23d159b <+363>:	callq  0x7ffff239bbf0 <__assert_fail@plt>
   0x00007ffff23d15a0 <+368>:	test   %eax,%eax
   0x00007ffff23d15a2 <+370>:	js     0x7ffff23d1581 <anv_nir_lower_ycbcr_textures+337>
   0x00007ffff23d15a4 <+372>:	cltq   
   0x00007ffff23d15a6 <+374>:	shl    $0x6,%rax
   0x00007ffff23d15aa <+378>:	add    %rax,%rsi
324-327:
   0x00007ffff23d162c <+508>:	movslq 0x3c(%rdx),%rax
   0x00007ffff23d163c <+524>:	shl    $0x4,%rax
328:
=> 0x00007ffff23d2458 <+4136>:	mov    0x3c,%eax
   0x00007ffff23d245f <+4143>:	ud2    
   0x00007ffff23d2461 <+4145>:	nopl   0x0(%rax)

(gdb) info registers rdx rax eax
rdx            0x7fffef424aa0	140737207487136
rax            0x5	5
eax            0x5	5

(gdb) up
#1  anv_nir_lower_ycbcr_textures (shader=shader@entry=0x7fffee3037e0, 
    layout=layout@entry=0x7fffef8249c0)
    at src/intel/vulkan/anv_nir_lower_ycbcr_textures.c:464
464	in src/intel/vulkan/anv_nir_lower_ycbcr_textures.c
(gdb) info locals
tex = <optimized out>
instr = 0x7fffef8e2430
__next = 0x7fffeef001c0
block = <optimized out>
function_progress = false
builder = {cursor = {option = nir_cursor_before_block, {block = 0x0, instr = 0x0}}, exact = false, 
  shader = 0x7fffee3037e0, impl = 0x7fffef919690}
function = <optimized out>
progress = <optimized out>
(gdb) print *instr
$2 = {node = {next = 0x7fffeef001c0, prev = 0x7fffeef00090}, block = 0x7fffef92d170, 
  type = nir_instr_type_tex, pass_flags = 1 '\001', index = 0}
(gdb) print *__next
$3 = {node = {next = 0x7fffeef002f0, prev = 0x7fffef8e2430}, block = 0x7fffef92d170, 
  type = nir_instr_type_alu, pass_flags = 1 '\001', index = 0}
--------------------------------------

Source code corresponding to try_lower_tex_ycbcr disassembly around the crash is:
--------------------------------------
322: int deref_src_idx = nir_tex_instr_src_index(tex, nir_tex_src_texture_deref);
323: assert(deref_src_idx >= 0);
324: nir_deref_instr *deref = nir_src_as_deref(tex->src[deref_src_idx].src);

326: nir_variable *var = nir_deref_instr_get_variable(deref);
327: const struct anv_descriptor_set_layout *set_layout =
328:   layout->set[var->data.descriptor_set].layout;
--------------------------------------

To me it seems that deref got invalid address.
Comment 2 Lionel Landwerlin 2019-01-11 11:30:42 UTC
Problem here is that there is a cast between texture instruction and the variable :

	vec1 32 ssa_72 = deref_var &shadow_map (uniform sampler2DShadow) 
	vec1 32 ssa_73 = deref_cast (sampler2DShadow *)ssa_72 (function sampler2DShadow) 
	vec2 32 ssa_74 = vec2 ssa_57, ssa_58
	vec1 32 ssa_75 = tex ssa_73 (texture_deref), ssa_73 (sampler_deref), ssa_74 (coord), ssa_59 (comparator), 


But the nir_deref_instr_get_variable() will stop on a cast and return NULL.
Maybe Jason has more background on what to do here?
Comment 3 Jason Ekstrand 2019-01-11 21:06:23 UTC
I am getting so tired of maintaining this workaround.... In any case, patches are now on the list that work around it some more:

https://patchwork.freedesktop.org/series/55099/
Comment 4 Jason Ekstrand 2019-01-12 23:57:24 UTC
Fixed by the following commit in master:

commit b57c1ec4219f01bfdb98bcab8fca4c44e87bd1a4
Author: Jason Ekstrand <jason.ekstrand@intel.com>
Date:   Fri Jan 11 14:17:24 2019 -0600

    spirv: Whack sampler/image pointers to uniform
    
    A long time in a galaxy far far away, there was a GLSLang bug with how
    it handled samplers passed in as function parameters.  (The bug can be
    found here: https://github.com/KhronosGroup/glslang/issues/179.)
    Unfortunately, that version was shipped in several apps and has been
    causing heartburn for our SPIR-V parser ever since.
    
    Recent changes to NIR uncovered a moderately old bug in how we work
    around this issue.  In particular, we ended up with a deref_cast from
    uniform to local which is not a no-op cast so nir_opt_deref wasn't
    getting rid of the cast.  The only reason why it worked before was
    because someone just happened to call nir_fixup_deref_modes which
    "fixed" the cast (that shouldn't be happening) and then a later round of
    copy-prop would get rid of it.  The fact that the deref_cast survived
    that long without causing trouble for other parts of NIR is a bit
    surprising.
    
    Just whacking the mode of the pointer seems to fix it fairly
    unobtrusively.  Currently, only apps with this bug will have a local
    variable containing an image or sampler.
Comment 5 Eero Tamminen 2019-01-14 08:35:18 UTC
Verified, works fine on all Vulkan platforms again.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.