Bug 108135 - AVX instructions leak outside of CPU feature check and cause SIGILL
Summary: AVX instructions leak outside of CPU feature check and cause SIGILL
Status: RESOLVED MOVED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Gallium/swr (show other bugs)
Version: git
Hardware: Other All
: medium normal
Assignee: Alok Hota
QA Contact: mesa-dev
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-10-02 21:00 UTC by Thiago Macieira
Modified: 2019-09-18 18:24 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
Attempt at making the variables local static (4.30 KB, patch)
2018-10-02 22:02 UTC, Thiago Macieira
Details | Splinter Review

Description Thiago Macieira 2018-10-02 21:00:15 UTC
Reported at https://github.com/clearlinux/distribution/issues/210

Symptom: when running certain applications (Xwayland in our case) on an Atom CPU, the application crashes with SIGILL. The backtrace:

#7 _Z41__static_initialization_and_destruction_0ii.constprop.120() - [swrast_dri.so] - lower_x86.cpp:73
#8 _GLOBAL__I_65535_0_gallium_dri_la_target.o.1657199() - [swrast_dri.so]
#9 call_init.part.0() - [ld-linux-x86-64.so.2] - dl-init.c:72
#10 _dl_init() - [ld-linux-x86-64.so.2] - dl-init.c:30
#11 dl_open_worker() - [ld-linux-x86-64.so.2] - dl-open.c:506
#12 _dl_catch_exception() - [libc.so.6] - dl-error-skeleton.c:196
#13 _dl_open() - [ld-linux-x86-64.so.2] - dl-open.c:588
#14 dlopen_doit() - [libdl.so.2] - dlopen.c:66
#15 _dl_catch_exception() - [libc.so.6] - dl-error-skeleton.c:196
#16 _dl_catch_error() - [libc.so.6] - dl-error-skeleton.c:215
#17 _dlerror_run() - [libdl.so.2] - dlerror.c:163
#18 dlopen@@GLIBC_2.2.5() - [libdl.so.2] - dlopen.c:87
#19 glxProbeDriver() - [/usr/bin/Xwayland] - glxdricommon.c:305
#20 __glXDRIscreenProbe() - [/usr/bin/Xwayland] - glxdriswrast.c:437
#21 xorgGlxServerInit() - [/usr/bin/Xwayland] - glxext.c:550
#22 _CallCallbacks() - [/usr/bin/Xwayland] - dixutils.c:737
#23 GlxExtensionInit() - [/usr/bin/Xwayland] - callback.h:83
#24 InitExtensions() - [/usr/bin/Xwayland] - miinitext.c:267
#25 dix_main() - [/usr/bin/Xwayland] - main.c:197
#26 __libc_start_main() - [libc.so.6] - libc-start.c:308
#27 _start() - [/usr/bin/Xwayland] - start.S:120

Investigation shows that the swrast_dri.so plugin contains AVX instructions outside of a CPUID check in the "__static_initialization_and_destruction_0" function. Further investigation shows some of those functions come from src/gallium/drivers/swr/rasterizer/jitter/functionpasses/lower_x86.cpp, while initialising the variable SwrJit::instructionMap (which is a global static non-POD of type std::map - see <https://gitlab.freedesktop.org/mesa/mesa/blob/master/src/gallium/drivers/swr/rasterizer/jitter/functionpasses/lower_x86.cpp#L66>).

This lower_x86.cpp file is compiled with -mavx. This means the compiler is free to generate AVX instructions anywhere in the file without a CPUID check, including in the dynamic initialisation and destruction code that the std::map variable requires.

The simplest solution I can think of, for this particular instruction, is to change it from a global static to a local static. This will solve the problem for this particular initialisation, but I think there are more variables with this problem.
Comment 1 Thiago Macieira 2018-10-02 22:02:30 UTC
Created attachment 141840 [details] [review]
Attempt at making the variables local static
Comment 2 Thiago Macieira 2018-10-03 03:30:06 UTC
The patch does solve the problem for the particular file, but there are more AVX uses in swrast_dri.so. The next issue is the intialisation of the builtin types in src/compiler/glsl_types.cpp, caused by:

#define DECL_TYPE(NAME, ...)                                    \
   const glsl_type glsl_type::_##NAME##_type = glsl_type(__VA_ARGS__, #NAME); \
   const glsl_type *const glsl_type::NAME##_type = &glsl_type::_##NAME##_type;

#define STRUCT_TYPE(NAME)

#include "compiler/builtin_type_macros.h"
Comment 3 Thiago Macieira 2018-10-03 04:19:44 UTC
(In reply to Thiago Macieira from comment #2)
> The patch does solve the problem for the particular file, but there are more
> AVX uses in swrast_dri.so. The next issue is the intialisation of the
> builtin types in src/compiler/glsl_types.cpp, caused by:

I believe that's a red herring. It's caused by our use of -flto in the build, meaning the use of AVX in one place "spills" to surrounding code that is otherwise innocent.

Without -flto, the next instruction to crash happens inside GlobalKnobs::GlobalKnobs(). I see this function (in fact, the entire gen_knobs.cpp file) present in at least two libraries: libmesaswr and libswrAVX. That is, both./src/gallium/drivers/swr/rasterizer/codegen/.libs/libmesaswr_la-gen_knobs.o and ./src/gallium/drivers/swr/rasterizer/codegen/.libs/libswrAVX_la-gen_knobs.o exist and contain this function. And BOTH files have AVX instructions. I can see the -mavx flag in the build.

Was this intended?
Comment 4 Alok Hota 2018-10-05 01:38:51 UTC
Hi Thiago,

Thanks for taking such a deep look into this. We probably shouldn't need to be specifically targeting a specific architecture in the jitter. We may be able to fix this by tweaking the build system.
Comment 5 Thiago Macieira 2018-10-05 02:56:42 UTC
Please note that Clear was affected by this GCC bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87522. HJ is on it. It affects LTO.

However, there's still one load-time initialisation left even without LTO. That's a showstopper for all distributions: they simply can't turn SWR on. If we want SWR to be used, that needs to be solved.

I'm working on a different solution for Clear, but it won't apply to other Linux distributions.
Comment 6 Kenneth Graunke 2018-10-06 06:31:32 UTC
This does point out an even more serious issue though:

Megadrivers interacts very badly with drivers that have global C++ initializers.  When dlopen'ing a DRI driver, C++ global constructors will run for ANY driver that is built in, even if that isn't the driver the user is trying to use.

In other words, let's say you're on a Raspberry Pi and are trying to use the vc4 driver.  If your distro happens to have built SWR, then you'll end up running a bunch of SWR C++ constructors...when trying to dlopen vc4_dri.so...because both live in the same .so file.

Such global initialization really ought to happen at driver screen init time, not dlopen time.  IMO we should forbid C++ global objects in Mesa drivers.  (We have a few in the compiler, but everybody uses the compiler.)  I don't know how feasible that is, though.  If it isn't, maybe we need to exclude such drivers from the megadrivers mechanism...
Comment 7 Thiago Macieira 2018-10-06 17:23:13 UTC
You can scan for .o that have initialisers by searching for .init_array sections
Comment 8 GitLab Migration User 2019-09-18 18:24:33 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/198.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.