Bug 101832

Summary: [PATCH][regression][bisect] Xorg fails to start after f50aa21456d82c8cb6fbaa565835f1acc1720a5d
Product: Mesa Reporter: Laurent carlier <lordheavym>
Component: Drivers/Gallium/swrAssignee: mesa-dev
Status: RESOLVED FIXED QA Contact: mesa-dev
Severity: blocker    
Priority: medium CC: andyrtr, bero, nick.tenney, timothy.o.rowley
Version: 17.2   
Hardware: All   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: xorg log file with the segfault
Workaround
debug output from knob initialization
Fix

Description Laurent carlier 2017-07-18 23:06:29 UTC
Created attachment 132751 [details]
xorg log file with the segfault

* llvm-svn 308274
* AMD Radeon RX 470 Graphics (AMD POLARIS10 / DRM 3.15.0 / 4.12.2-1-ARCH, LLVM 5.0.0)

Bisecting gives:
f50aa21456d82c8cb6fbaa565835f1acc1720a5d is the first bad commit
commit f50aa21456d82c8cb6fbaa565835f1acc1720a5d
Author: Tim Rowley <timothy.o.rowley@intel.com>
Date:   Thu Jun 29 14:37:07 2017 -0500

    swr: build driver proper separate from rasterizer

bisect log:
git bisect start
# good: [6c7b7aa3d8323a7cde5ab2b84fabc16913adeab4] a5xx: fix condition for updating *_FS_OUTPUT_CNTL
git bisect good 6c7b7aa3d8323a7cde5ab2b84fabc16913adeab4
# bad: [28ccf8587e1e1c0e9a7b08296807c343f33dc9de] i965/gen4: Set tile offsets to zero after depth rebase
git bisect bad 28ccf8587e1e1c0e9a7b08296807c343f33dc9de
# bad: [2b895475f600b142e9ccbfb3b33009fe68b21162] util: Remove u_math from u_vector
git bisect bad 2b895475f600b142e9ccbfb3b33009fe68b21162
# bad: [aadd37298c704982036f64e58903c80cd7dac93b] i965/miptree: Add a return for updating of winsys
git bisect bad aadd37298c704982036f64e58903c80cd7dac93b
# bad: [4d8191fd000071328b97bde5fc31ab1c39238d27] egl: check for extensions' presence during attr parsing
git bisect bad 4d8191fd000071328b97bde5fc31ab1c39238d27
# good: [618be8cc1ad1760103930b69ffbf528d7b861ab3] i965: Resolve framebuffers before signaling the fence
git bisect good 618be8cc1ad1760103930b69ffbf528d7b861ab3
# bad: [314879f7fec07cedb5263681173a22d522a8ac9a] i965: Fix asynchronous mappings on !LLC platforms.
git bisect bad 314879f7fec07cedb5263681173a22d522a8ac9a
# good: [27c5568de3674ec95f02816a06b13180bad0838b] swr/rast: make SWR_VISIBLE attribute work for windows
git bisect good 27c5568de3674ec95f02816a06b13180bad0838b
# bad: [f50aa21456d82c8cb6fbaa565835f1acc1720a5d] swr: build driver proper separate from rasterizer
git bisect bad f50aa21456d82c8cb6fbaa565835f1acc1720a5d
# good: [50cd222116b40e4df2462cb25a92960d557c9144] swr: switch to using SwrGetInterface api table
git bisect good 50cd222116b40e4df2462cb25a92960d557c9144
# first bad commit: [f50aa21456d82c8cb6fbaa565835f1acc1720a5d] swr: build driver proper separate from rasterizer
Comment 1 Laurent carlier 2017-07-19 10:47:59 UTC
mesa is built with:

  ./autogen.sh --prefix=/usr \
  --sysconfdir=/etc \
  --with-dri-driverdir=/usr/lib/xorg/modules/dri \
  --with-gallium-drivers=r300,r600,radeonsi,nouveau,svga,swrast,virgl,swr \
  --with-dri-drivers=i915,i965,r200,radeon,nouveau,swrast \
  --with-platforms=x11,drm,wayland \
  --with-vulkan-drivers=intel,radeon \
  --disable-xvmc \
  --enable-llvm \
  --enable-llvm-shared-libs \
  --enable-shared-glapi \
  --enable-libglvnd \
  --enable-libunwind \
  --enable-lmsensors \
  --enable-egl \
  --enable-glx \
  --enable-glx-tls \
  --enable-gles1 \
  --enable-gles2 \
  --enable-gbm \
  --enable-dri \
  --enable-gallium-osmesa \
  --enable-gallium-extra-hud \
  --enable-texture-float \
  --enable-xa \
  --enable-vdpau \
  --enable-omx \
  --enable-nine \
  --enable-opencl \
  --enable-opencl-icd \
  --with-clang-libdir=/usr/lib
Comment 2 Laurent carlier 2017-07-19 13:22:59 UTC
building without swr fixes the problem
Comment 3 Emil Velikov 2017-07-19 14:06:06 UTC
Seems like some binary is having unresolved symbols - unw_get_proc_name at least.

AFAICT it cannot happen for the DRI module, and since you're not using SWR none of it backends should be attempted, let alone loaded.

Please check all the binaries for "undefined symbol" via $ldd -r $binary 

Thanks!
Comment 4 Laurent carlier 2017-07-19 14:14:25 UTC
got this:
[lordh@lordh-pc lib]$ ldd -r libswrAVX2.so.0.0.0
        linux-vdso.so.1 (0x00007ffefe1b9000)
        libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007f8206c77000)
        libm.so.6 => /usr/lib/libm.so.6 (0x00007f8206965000)
        libc.so.6 => /usr/lib/libc.so.6 (0x00007f82065bf000)
        /usr/lib64/ld-linux-x86-64.so.2 (0x000055844c2fe000)
        libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x00007f82063a8000)
undefined symbol: pthread_create        (./libswrAVX2.so.0.0.0)
undefined symbol: pthread_setaffinity_np        (./libswrAVX2.so.0.0.0)
undefined symbol: pthread_setname_np    (./libswrAVX2.so.0.0.0)

[lordh@lordh-pc lib]$ ldd -r libswrAVX.so.0.0.0 
        linux-vdso.so.1 (0x00007ffdd86e7000)
        libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007f6e74166000)
        libm.so.6 => /usr/lib/libm.so.6 (0x00007f6e73e54000)
        libc.so.6 => /usr/lib/libc.so.6 (0x00007f6e73aae000)
        /usr/lib64/ld-linux-x86-64.so.2 (0x0000557dc176b000)
        libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x00007f6e73897000)
undefined symbol: pthread_create        (./libswrAVX.so.0.0.0)
undefined symbol: pthread_setaffinity_np        (./libswrAVX.so.0.0.0)
undefined symbol: pthread_setname_np    (./libswrAVX.so.0.0.0)
Comment 5 Emil Velikov 2017-07-21 13:00:32 UTC
Right, so I may have misread the Xorg.log, but at least the AVX binaries will not have unresolved symbols, props to https://patchwork.freedesktop.org/patch/168170/

Possible symbol collision comes to mind, but I'm not working on either SWR or radeonsi :-\

Tim, can you please have a look?
Comment 6 Laurent carlier 2017-07-24 15:30:40 UTC
The patchset fixes the unresolved symbols, but segfault is still here. I will try to grab a better backtrace.
Comment 7 Laurent carlier 2017-07-24 15:54:33 UTC
With debug symbols, backtrace is a bit different:

[442527.173] (EE) Backtrace:
[442527.174] (EE) 0: /usr/lib/xorg-server/Xorg (OsSigHandler+0x2a) [0x5645b1d61fba]
[442527.174] (EE) 1: /usr/lib/libpthread.so.0 (funlockfile+0x50) [0x7fe7639f982f]
[442527.174] (EE) 2: /usr/lib/libc.so.6 (strlen+0x26) [0x7fe7636c48c6]
[442527.175] (EE) 3: /usr/lib/xorg/modules/dri/radeonsi_dri.so (_ZN8KnobBase30autoExpandEnvironmentVariablesERNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x120) [0x7fe75eb34630]
[442527.175] (EE) 4: /usr/lib/xorg/modules/dri/radeonsi_dri.so (_ZN11GlobalKnobsC1Ev+0x13b) [0x7fe75eb34fbb]
[442527.175] (EE) 5: /usr/lib/xorg/modules/dri/radeonsi_dri.so (_GLOBAL__sub_I_gen_knobs.cpp+0x10) [0x7fe75e3f25f0]
[442527.176] (EE) 6: /lib64/ld-linux-x86-64.so.2 (call_init.part.0+0x9a) [0x7fe765c5237a]
[442527.176] (EE) 7: /lib64/ld-linux-x86-64.so.2 (_dl_init+0x76) [0x7fe765c52486]
[442527.176] (EE) 8: /lib64/ld-linux-x86-64.so.2 (dl_open_worker+0x38e) [0x7fe765c5693e]
[442527.177] (EE) 9: /usr/lib/libc.so.6 (_dl_catch_error+0x84) [0x7fe763769e44]
[442527.177] (EE) 10: /lib64/ld-linux-x86-64.so.2 (_dl_open+0xca) [0x7fe765c5615a]
[442527.177] (EE) unw_get_proc_name failed: no unwind info found [-10]
[442527.177] (EE) 11: /usr/lib/libdl.so.2 (?+0xca) [0x7fe7652c3f1a]
[442527.177] (EE) 12: /usr/lib/libc.so.6 (_dl_catch_error+0x84) [0x7fe763769e44]
[442527.178] (EE) 13: /usr/lib/libdl.so.2 (dlerror+0x2e7) [0x7fe7652c4827]
[442527.178] (EE) 14: /usr/lib/libdl.so.2 (dlopen+0x42) [0x7fe7652c3f42]
[442527.178] (EE) 15: /usr/lib/libgbm.so.1 (dri_open_driver.isra.5+0x1b4) [0x7fe75fef8984]
[442527.178] (EE) 16: /usr/lib/libgbm.so.1 (dri_screen_create_dri2+0x2c) [0x7fe75fef8aac]
[442527.178] (EE) 17: /usr/lib/libgbm.so.1 (dri_device_create+0x168) [0x7fe75fef8f28]
[442527.178] (EE) 18: /usr/lib/libgbm.so.1 (gbm_create_device+0x57) [0x7fe75fef6e07]
[442527.178] (EE) 19: /usr/lib/xorg/modules/drivers/amdgpu_drv.so (_init+0x7ffd) [0x7fe7603218dd]
[442527.179] (EE) 20: /usr/lib/xorg-server/Xorg (InitOutput+0xb10) [0x5645b1c3edc0]
[442527.179] (EE) 21: /usr/lib/xorg-server/Xorg (dix_main+0x1e2) [0x5645b1bfbb92]
[442527.179] (EE) 22: /usr/lib/libc.so.6 (__libc_start_main+0xea) [0x7fe7636624ca]
[442527.179] (EE) 23: /usr/lib/xorg-server/Xorg (_start+0x2a) [0x5645b1be553a]
[442527.179] (EE)
[442527.179] (EE) Segmentation fault at address 0x0
[442527.179] (EE)
Fatal server error:
[442527.179] (EE) Caught signal 11 (Segmentation fault). Server aborting
[442527.179] (EE)
[442527.179] (EE)
Please consult the The X.Org Foundation support
         at http://wiki.x.org
 for help.
Comment 8 Bernhard Rosenkraenzer 2017-08-16 10:59:55 UTC
This is unrelated to the radeonsi driver -- the exact same commit causes a similar crash here on a laptop with an Intel and Nouveau GPU.

(EE) Backtrace:
(EE) 0: /usr/libexec/Xorg (xorg_backtrace+0x33) [0x56095d]
(EE) 1: /usr/libexec/Xorg (0x400000+0x164414) [0x564414]
(EE) 2: /lib64/libpthread.so.0 (0x3547e00000+0xfb70) [0x3547e0fb70]
(EE) 3: /lib64/libc.so.6 (0x3547a00000+0x113131) [0x3547b13131]
(EE) 4: /usr/lib64/dri/nouveau_dri.so (0x7f8127608000+0xa74c4e) [0x7f812807cc4e]
(EE) 5: /usr/lib64/dri/nouveau_dri.so (0x7f8127608000+0xa752e1) [0x7f812807d2e1]
(EE) 6: /usr/lib64/dri/nouveau_dri.so (0x7f8127608000+0x77ff0) [0x7f812767fff0]
(EE) 7: /lib64/ld-linux-x86-64.so.2 (0x3547600000+0xc0c9) [0x354760c0c9]
(EE) 8: /lib64/ld-linux-x86-64.so.2 (0x3547600000+0xc1d0) [0x354760c1d0]
(EE) 9: /lib64/ld-linux-x86-64.so.2 (0x3547600000+0xf6dc) [0x354760f6dc]
(EE) 10: /lib64/libc.so.6 (_dl_catch_error+0x72) [0x3547aee17b]
(EE) 11: /lib64/ld-linux-x86-64.so.2 (0x3547600000+0xedc1) [0x354760edc1]
(EE) 12: /lib64/libdl.so.2 (0x3548200000+0x1006) [0x3548201006]
(EE) 13: /lib64/libc.so.6 (_dl_catch_error+0x72) [0x3547aee17b]
(EE) 14: /lib64/libdl.so.2 (0x3548200000+0x1505) [0x3548201505]
(EE) 15: /lib64/libdl.so.2 (dlopen+0x35) [0x3548201043]
(EE) 16: /usr/lib64/libgbm.so.1 (0x7f81285af000+0x4ab4) [0x7f81285b3ab4]
(EE) 17: /usr/lib64/libgbm.so.1 (0x7f81285af000+0x4bf4) [0x7f81285b3bf4]
(EE) 18: /usr/lib64/libgbm.so.1 (0x7f81285af000+0x52a8) [0x7f81285b42a8]
(EE) 19: /usr/lib64/libgbm.so.1 (0x7f81285af000+0x2e1b) [0x7f81285b1e1b]
(EE) 20: /usr/lib64/libgbm.so.1 (gbm_create_device+0x39) [0x7f81285b1e89]
(EE) 21: /usr/lib64/xorg/modules/libglamoregl.so (glamor_egl_init+0x80) [0x7f8128630ca0]
(EE) 22: /usr/lib64/xorg/modules/drivers/modesetting_drv.so (0x7f812a2ca000+0x7d72) [0x7f812a2d1d72]
(EE) 23: /usr/libexec/Xorg (InitOutput+0x1660) [0x46be93]
(EE) 24: /usr/libexec/Xorg (0x400000+0x20de1) [0x420de1]
(EE) 25: /lib64/libc.so.6 (__libc_start_main+0x15a) [0x3547a21de3]
(EE) 26: /usr/libexec/Xorg (_start+0x2a) [0x420aba]
(EE)
(EE) Segmentation fault at address 0x0
Comment 9 Bernhard Rosenkraenzer 2017-08-16 12:37:13 UTC
Created attachment 133549 [details] [review]
Workaround

This "fixes" it (forward-port of reverting the commit causing the problem, applies cleanly on 17.2.0-rc4) -- but obviously it isn't a perfect fix because it brings back the problems the original commit was meant to solve.

Certainly better than X crashing on startup though ;)
Comment 10 Tim Rowley 2017-08-16 18:03:28 UTC
Created attachment 133557 [details] [review]
debug output from knob initialization

Looking at Laurent's backtrace, it appears to be a problem with the initialization of the swr knobs structure (global c++ object constructor).  Not sure why that ends up crashing.  I'm not setup for running an X server with a built dri driverset; if you could try running with the following patch, the messages might help point to what's happening.  Thanks.
Comment 11 Bernhard Rosenkraenzer 2017-08-30 21:24:07 UTC
The output of the patch doesn't look too helpful to me:

(II) glamor: OpenGL accelerated X.org driver based.
SWR_DEBUG env /tmp/Rast/DebugOutput
SWR_DEBUG env ${HOME}/.swr/jitcache
(EE)
(EE) Backtrace:
(EE) 0: /usr/libexec/Xorg (xorg_backtrace+0x33) [0x56095d]
(EE) 1: /usr/libexec/Xorg (0x400000+0x164414) [0x564414]
(EE) 2: /lib64/libpthread.so.0 (0x3210e00000+0xfb70) [0x3210e0fb70]
(EE) 3: /lib64/libc.so.6 (0x3210a00000+0x113341) [0x3210b13341]
(EE) 4: /usr/lib64/dri/nouveau_dri.so (0x7fa196120000+0xade2d8) [0x7fa196bfe2d8]
(EE) 5: /usr/lib64/dri/nouveau_dri.so (0x7fa196120000+0xadec23) [0x7fa196bfec23]
(EE) 6: /usr/lib64/dri/nouveau_dri.so (0x7fa196120000+0x74086) [0x7fa196194086]
(EE) 7: /lib64/ld-linux-x86-64.so.2 (0x3210600000+0xc0c2) [0x321060c0c2]
(EE) 8: /lib64/ld-linux-x86-64.so.2 (0x3210600000+0xc1c9) [0x321060c1c9]
(EE) 9: /lib64/ld-linux-x86-64.so.2 (0x3210600000+0xf6d5) [0x321060f6d5]
(EE) 10: /lib64/libc.so.6 (_dl_catch_error+0x72) [0x3210aee38b]
(EE) 11: /lib64/ld-linux-x86-64.so.2 (0x3210600000+0xedba) [0x321060edba]
(EE) 12: /lib64/libdl.so.2 (0x3211200000+0x1006) [0x3211201006]
(EE) 13: /lib64/libc.so.6 (_dl_catch_error+0x72) [0x3210aee38b]
(EE) 14: /lib64/libdl.so.2 (0x3211200000+0x1505) [0x3211201505]
(EE) 15: /lib64/libdl.so.2 (dlopen+0x35) [0x3211201043]
(EE) 16: /usr/lib64/libgbm.so.1 (0x371a000000+0x4a34) [0x371a004a34]
(EE) 17: /usr/lib64/libgbm.so.1 (0x371a000000+0x5197) [0x371a005197]
(EE) 18: /usr/lib64/libgbm.so.1 (0x371a000000+0x2dab) [0x371a002dab]
(EE) 19: /usr/lib64/libgbm.so.1 (gbm_create_device+0x44) [0x371a002e14]
(EE) 20: /usr/lib64/xorg/modules/libglamoregl.so (glamor_egl_init+0x80) [0x7fa1989efca0]
(EE) 21: /usr/lib64/xorg/modules/drivers/modesetting_drv.so (0x7fa198a2d000+0x7d72) [0x7fa198a34d72]
(EE) 22: /usr/libexec/Xorg (InitOutput+0x1660) [0x46be93]
(EE) 23: /usr/libexec/Xorg (0x400000+0x20de1) [0x420de1]
(EE) 24: /lib64/libc.so.6 (__libc_start_main+0x15a) [0x3210a21de3]
(EE) 25: /usr/libexec/Xorg (_start+0x2a) [0x420aba]
(EE)
(EE) Segmentation fault at address 0x0
(EE)
Fatal server error:
(EE) Caught signal 11 (Segmentation fault). Server aborting


This is with 17.2-rc5. Still works correctly with my workaround patch applied.
Comment 12 Bernhard Rosenkraenzer 2017-08-30 23:12:47 UTC
The debug output was more useful than I initially thought ;)

The crash happens when getenv() in GetEnv() returns NULL, leading to the std::string constructor getting a NULL constructor.
Comment 13 Bernhard Rosenkraenzer 2017-08-30 23:16:49 UTC
Created attachment 133894 [details] [review]
Fix

Here's a fix... Probably should (in addition to this) catch the cache directory pointing somewhere invalid though...

Obviously sddm, kdm, lightdm, gdm and friends won't have $HOME set when starting X...
Comment 14 Emil Velikov 2017-09-08 12:31:00 UTC
(In reply to Bernhard Rosenkraenzer from comment #13)
> Created attachment 133894 [details] [review] [review]
> Fix
> 
> Here's a fix... Probably should (in addition to this) catch the cache
> directory pointing somewhere invalid though...
> 
> Obviously sddm, kdm, lightdm, gdm and friends won't have $HOME set when
> starting X...
Bernhard please send git patches to the list [1]. Do include the following two lines in the commit message.


CC: Tim Rowley <timothy.o.rowley@intel.com>
Fixes: a25093de718 ("swr/rast: Implement JIT shader caching to disk")

[1] https://www.mesa3d.org/submittingpatches.html
Comment 15 Laurent carlier 2017-10-02 17:07:15 UTC
Fixed with commit 21e271024d8e050b75361c2da2e5783100f2e87b

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.