Reported at https://github.com/clearlinux/distribution/issues/210 Symptom: when running certain applications (Xwayland in our case) on an Atom CPU, the application crashes with SIGILL. The backtrace: #7 _Z41__static_initialization_and_destruction_0ii.constprop.120() - [swrast_dri.so] - lower_x86.cpp:73 #8 _GLOBAL__I_65535_0_gallium_dri_la_target.o.1657199() - [swrast_dri.so] #9 call_init.part.0() - [ld-linux-x86-64.so.2] - dl-init.c:72 #10 _dl_init() - [ld-linux-x86-64.so.2] - dl-init.c:30 #11 dl_open_worker() - [ld-linux-x86-64.so.2] - dl-open.c:506 #12 _dl_catch_exception() - [libc.so.6] - dl-error-skeleton.c:196 #13 _dl_open() - [ld-linux-x86-64.so.2] - dl-open.c:588 #14 dlopen_doit() - [libdl.so.2] - dlopen.c:66 #15 _dl_catch_exception() - [libc.so.6] - dl-error-skeleton.c:196 #16 _dl_catch_error() - [libc.so.6] - dl-error-skeleton.c:215 #17 _dlerror_run() - [libdl.so.2] - dlerror.c:163 #18 dlopen@@GLIBC_2.2.5() - [libdl.so.2] - dlopen.c:87 #19 glxProbeDriver() - [/usr/bin/Xwayland] - glxdricommon.c:305 #20 __glXDRIscreenProbe() - [/usr/bin/Xwayland] - glxdriswrast.c:437 #21 xorgGlxServerInit() - [/usr/bin/Xwayland] - glxext.c:550 #22 _CallCallbacks() - [/usr/bin/Xwayland] - dixutils.c:737 #23 GlxExtensionInit() - [/usr/bin/Xwayland] - callback.h:83 #24 InitExtensions() - [/usr/bin/Xwayland] - miinitext.c:267 #25 dix_main() - [/usr/bin/Xwayland] - main.c:197 #26 __libc_start_main() - [libc.so.6] - libc-start.c:308 #27 _start() - [/usr/bin/Xwayland] - start.S:120 Investigation shows that the swrast_dri.so plugin contains AVX instructions outside of a CPUID check in the "__static_initialization_and_destruction_0" function. Further investigation shows some of those functions come from src/gallium/drivers/swr/rasterizer/jitter/functionpasses/lower_x86.cpp, while initialising the variable SwrJit::instructionMap (which is a global static non-POD of type std::map - see <https://gitlab.freedesktop.org/mesa/mesa/blob/master/src/gallium/drivers/swr/rasterizer/jitter/functionpasses/lower_x86.cpp#L66>). This lower_x86.cpp file is compiled with -mavx. This means the compiler is free to generate AVX instructions anywhere in the file without a CPUID check, including in the dynamic initialisation and destruction code that the std::map variable requires. The simplest solution I can think of, for this particular instruction, is to change it from a global static to a local static. This will solve the problem for this particular initialisation, but I think there are more variables with this problem.
Created attachment 141840 [details] [review] Attempt at making the variables local static
The patch does solve the problem for the particular file, but there are more AVX uses in swrast_dri.so. The next issue is the intialisation of the builtin types in src/compiler/glsl_types.cpp, caused by: #define DECL_TYPE(NAME, ...) \ const glsl_type glsl_type::_##NAME##_type = glsl_type(__VA_ARGS__, #NAME); \ const glsl_type *const glsl_type::NAME##_type = &glsl_type::_##NAME##_type; #define STRUCT_TYPE(NAME) #include "compiler/builtin_type_macros.h"
(In reply to Thiago Macieira from comment #2) > The patch does solve the problem for the particular file, but there are more > AVX uses in swrast_dri.so. The next issue is the intialisation of the > builtin types in src/compiler/glsl_types.cpp, caused by: I believe that's a red herring. It's caused by our use of -flto in the build, meaning the use of AVX in one place "spills" to surrounding code that is otherwise innocent. Without -flto, the next instruction to crash happens inside GlobalKnobs::GlobalKnobs(). I see this function (in fact, the entire gen_knobs.cpp file) present in at least two libraries: libmesaswr and libswrAVX. That is, both./src/gallium/drivers/swr/rasterizer/codegen/.libs/libmesaswr_la-gen_knobs.o and ./src/gallium/drivers/swr/rasterizer/codegen/.libs/libswrAVX_la-gen_knobs.o exist and contain this function. And BOTH files have AVX instructions. I can see the -mavx flag in the build. Was this intended?
Hi Thiago, Thanks for taking such a deep look into this. We probably shouldn't need to be specifically targeting a specific architecture in the jitter. We may be able to fix this by tweaking the build system.
Please note that Clear was affected by this GCC bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87522. HJ is on it. It affects LTO. However, there's still one load-time initialisation left even without LTO. That's a showstopper for all distributions: they simply can't turn SWR on. If we want SWR to be used, that needs to be solved. I'm working on a different solution for Clear, but it won't apply to other Linux distributions.
This does point out an even more serious issue though: Megadrivers interacts very badly with drivers that have global C++ initializers. When dlopen'ing a DRI driver, C++ global constructors will run for ANY driver that is built in, even if that isn't the driver the user is trying to use. In other words, let's say you're on a Raspberry Pi and are trying to use the vc4 driver. If your distro happens to have built SWR, then you'll end up running a bunch of SWR C++ constructors...when trying to dlopen vc4_dri.so...because both live in the same .so file. Such global initialization really ought to happen at driver screen init time, not dlopen time. IMO we should forbid C++ global objects in Mesa drivers. (We have a few in the compiler, but everybody uses the compiler.) I don't know how feasible that is, though. If it isn't, maybe we need to exclude such drivers from the megadrivers mechanism...
You can scan for .o that have initialisers by searching for .init_array sections
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/198.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.