Created attachment 118566 [details] gdb backtrace I use archlinux 64 bits, flightgear 3.4, ( CPU : intel pentium dual core 3.3 Ghz, 4 Gb ram, graphic card : radeon HD4650 Pcie ), flightgear worked well 2 mounths ago, but since a recent upgrade of some system packages ( mesa 11.0.2-1, kernel 4.2.1, xorg-server 1.17.2-4, gcc-multilib 5.2.0-2 ) it doesn't work, I think mesa 11 is the culprit, flightgear doesn't work, it crash few seconds after the display of the splash screen, the error message in the console is : Illegal instruction (core dumped) I have a radeon HD4650 Pci with open source driver ( radeon ), maybe the bug doesn't occur with a nvidia graphic card, glxinfo : OpenGL vendor string: X.Org OpenGL renderer string: Gallium 0.4 on AMD RV730 (DRM 2.43.0, LLVM 3.7.0) OpenGL core profile version string: 3.3 (Core Profile) Mesa 11.0.2 OpenGL core profile shading language version string: 3.30 I tried to debug with gdb and I get a backtrace : (gdb) thread apply all bt full Thread 13 (Thread 0x7fffeb18c700 (LWP 18427)): #0 0x00007ffff770907f in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0 No symbol table info available. #1 0x0000000001176bfb in LogStreamPrivate::run() () No symbol table info available. #2 0x000000000125c1ea in SGThread::PrivateData::start_routine(void*) () No symbol table info available. #3 0x00007ffff77034a4 in start_thread () from /usr/lib/libpthread.so.0 No symbol table info available. #4 0x00007ffff294a13d in clone () from /usr/lib/libc.so.6 No symbol table info available. Thread 12 (Thread 0x7fffe4151700 (LWP 18426)): #0 0x00007ffff294118d in poll () from /usr/lib/libc.so.6 No symbol table info available. #1 0x00007fffdc1425e6 in ?? () from /usr/lib/libasound.so.2 No symbol table info available. #2 0x00007ffff7bae55a in ?? () from /usr/lib/libopenal.so.1 No symbol table info available. #3 0x00007ffff7bb70d7 in ?? () from /usr/lib/libopenal.so.1 No symbol table info available. #4 0x00007ffff77034a4 in start_thread () from /usr/lib/libpthread.so.0 No symbol table info available. #5 0x00007ffff294a13d in clone () from /usr/lib/libc.so.6 No symbol table info available. Thread 10 (Thread 0x7fffdffff700 (LWP 18400)): #0 0x00007ffff770907f in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0 No symbol table info available. #1 0x000000000117fdcb in (anonymous namespace)::Resolver::run() () No symbol table info available. ---Type <return> to continue, or q <return> to quit--- #2 0x000000000125c1ea in SGThread::PrivateData::start_routine(void*) () No symbol table info available. #3 0x00007ffff77034a4 in start_thread () from /usr/lib/libpthread.so.0 No symbol table info available. #4 0x00007ffff294a13d in clone () from /usr/lib/libc.so.6 No symbol table info available. Thread 7 (Thread 0x7fffe5d4f700 (LWP 18378)): #0 0x00007ffff7fe45eb in ?? () No symbol table info available. #1 0x0000000001cb2490 in ?? () No symbol table info available. #2 0x00007fffe5d4d740 in ?? () No symbol table info available. #3 0x0000000000000007 in ?? () No symbol table info available. #4 0x0000000000000000 in ?? () No symbol table info available. Thread 6 (Thread 0x7fffe6755700 (LWP 18372)): #0 0x00007ffff770907f in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0 No symbol table info available. #1 0x00007fffea16e04a in ?? () from /usr/lib/xorg/modules/dri/r600_dri.so No symbol table info available. #2 0x00007fffea16d787 in ?? () from /usr/lib/xorg/modules/dri/r600_dri.so No symbol table info available. #3 0x00007ffff77034a4 in start_thread () from /usr/lib/libpthread.so.0 No symbol table info available. #4 0x00007ffff294a13d in clone () from /usr/lib/libc.so.6 No symbol table info available. Thread 1 (Thread 0x7ffff7ee5800 (LWP 18362)): ---Type <return> to continue, or q <return> to quit--- #0 0x00007ffff770907f in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0 No symbol table info available. #1 0x00007ffff5bdc1ee in OpenThreads::Condition::wait(OpenThreads::Mutex*) () from /usr/lib/libOpenThreads.so.20 No symbol table info available. #2 0x00007ffff665a898 in osgViewer::ViewerBase::renderingTraversals() () from /usr/lib/libosgViewer.so.100 No symbol table info available. #3 0x0000000000d64edb in fgOSMainLoop () at /home/cesar/compilation/pkg/flightgear3.4/src/flightgear-3.4.0/src/Viewer/fg_os_osgviewer.cxx:335 No locals. #4 0x0000000000725fef in fgMainInit (argc=16, argv=<optimized out>) at /home/cesar/compilation/pkg/flightgear3.4/src/flightgear-3.4.0/src/Main/main.cxx:519 version = "3.4.0" col = <optimized out> configResult = <optimized out> showLauncher = <optimized out> result = <optimized out> #5 0x00000000006d48aa in main (argc=16, argv=0x7fffffffe3a8) at /home/cesar/compilation/pkg/flightgear3.4/src/flightgear-3.4.0/src/Main/bootstrap.cxx:234 _hostname = "ultima-dbr\000\000\000\000\000\000[\000\000\000n", '\000' <repeats 19 times>, "w\000\000\000|\000\000\000pF\270\363\377\177\000\000В\217\001\000\000\000\000\200qG\001\000\000\000\000\300qG\001\000\000\000\000\003\000\000\000\000\000\000\000Xwh\363\377\177\000\000\003\000\000\000\000\000\000\000\000{\207\063\331a+\267\020\004\000\000\000\000\000\000@qG\001\000\000\000\000yF\270\363\377\177\000\000\200qG\001\000\000\000\000\300qG\001\000\000\000\000\201\062h\363\377\177\000\000В\217\001\000\000\000\000\000{\207\063\331a+\267\200qG\001\000\000\000\000\030\002\000\000\000\000\000\000\032\002\000\000\000\000\000\000\031\002"... fgviewer = false
Created attachment 118567 [details] glxinfo the output of glxinfo
I notice that "export LIBGL_ALWAYS_SOFTWARE=1" triggers always a crash on every openGL if I use it ( opengl software emulation ), why "export LIBGL_ALWAYS_SOFTWARE=1" doesn't work with mesa 11 ?
Created attachment 118571 [details] apitrace when the bug occurs here is an apitrace when the bug occurs
(In reply to Barto from comment #2) > I notice that "export LIBGL_ALWAYS_SOFTWARE=1" triggers always a crash on > every openGL if I use it ( opengl software emulation ), > > why "export LIBGL_ALWAYS_SOFTWARE=1" doesn't work with mesa 11 ? Works like a charm here ;-) I always double-check that things aren't stuffed before making a release. Can you rebuild the package with debug info [1] and provide another backtrace ? -Emil P.S. Not a radeon dev, but a fellow archer :) [1] https://wiki.archlinux.org/index.php/Debug_-_Getting_Traces#PKGBUILD
(In reply to Emil Velikov from comment #4) > Works like a charm here ;-) I always double-check that things aren't stuffed > before making a release. > > Can you rebuild the package with debug info [1] and provide another > backtrace ? this is strange, because each time I use "export LIBGL_ALWAYS_SOFTWARE=1" I get a crash in openGL application, for example with "glxgears" : $ export LIBGL_ALWAYS_SOFTWARE=1 $ glxgears Illegal instruction (core dumped) maybe the bug occurs only with r600 driver, I will try to rebuild mesa package with debug info and I will post a new backtrace
Created attachment 118575 [details] new backtrace with a debug version of mesa 11.0.2 new backtrace with a debug version of mesa 11.0.2
Running with LIBGL_ALWAYS_SOFTWARE removes any hardware specifics and glxgears/foo runs on the CPU alone. Suspecting a different bug - perhaps glibc 2.22/llvm 3.7 related ? I'm using same gcc + glibc 2.21-4 + llvm 3.6.2-3. Might want to open another bug report for that one.
(In reply to Emil Velikov from comment #7) > Running with LIBGL_ALWAYS_SOFTWARE removes any hardware specifics and > glxgears/foo runs on the CPU alone. Suspecting a different bug - perhaps > glibc 2.22/llvm 3.7 related ? I'm using same gcc + glibc 2.21-4 + llvm > 3.6.2-3. > > Might want to open another bug report for that one. I use glibc 2.22-3 ( not 2.21-4 ) and llvm 3.7.0-4, they are in the archlinux stable repositories, "export LIBGL_ALWAYS_SOFTWARE=1" doesn't work with all openGL programs, glxgears, blender, 3D games, they all crash with the error "Illegal instruction (core dumped)" if "LIBGL_ALWAYS_SOFTWARE" is set to "1"
with apitrace I found maybe an interesting thing when I replay my trace file for the bug : t seems that the function "glRasterPos2i(0,0)" is the last openGL call before the crash, if I try to run this call in apitrace ( with a double click ) I can see this error message : caught an unhandled exception glretrace+0x23c99c /usr/lib/libpthread.so.0+0x10d5f ?+0x7f2f764655eb apitrace: info: taking default action for signal 4 I followed my instinct, I found a C++ source code in google about glRasterPos2i() function and I compile it in order to test the glRasterPos2i() function : #include <iostream> #include <GL/glut.h> #include <string> using namespace std; void display() { glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT); glMatrixMode(GL_MODELVIEW); glLoadIdentity(); glColor3ub(255,0,0); glPushMatrix(); glScalef(5,5,5); glBegin(GL_QUADS); glVertex2f(-1,-1); glVertex2f(1,-1); glVertex2f(1,1); glVertex2f(-1,1); glEnd(); glPopMatrix(); glColor3ub(0,255,0); // A glRasterPos2i(0,0); // B string tmp( "wha-hey!" ); for( size_t i = 0; i < tmp.size(); ++i ) { glutBitmapCharacter(GLUT_BITMAP_HELVETICA_18, tmp[i]); } glutSwapBuffers(); } void reshape(int w, int h) { glViewport(0, 0, w, h); glMatrixMode(GL_PROJECTION); glLoadIdentity(); double aspect_ratio = (double)w / (double)h; glOrtho(-10*aspect_ratio, 10*aspect_ratio, -10, 10, -1, 1); } int main(int argc, char **argv) { glutInit(&argc, argv); glutInitDisplayMode(GLUT_RGBA | GLUT_DEPTH | GLUT_DOUBLE); glutInitWindowSize(800,600); glutCreateWindow("Text"); glutDisplayFunc(display); glutReshapeFunc(reshape); glutMainLoop(); return EXIT_SUCCESS; } If I run this test code I get the same error message : "illegal instruction", but if I comment the line "glRasterPos2i(0,0);" then there is no bug, the test program can run and I see a screen with a red square, I am a newbie in openGL programming but it seems that mesa 11.0.2 has a problem with "glRasterPos2i()" function if the r600 gallium driver is used ( amd radeon HD4650 Pcie graphic card )
Some very quick search turned up this bug, which looks like it could be a winner: https://bugs.archlinux.org/task/43010 Seems some libpthread implementations will try to call hw lock elision instructions on cpus where lock elision isn't supported, but _only_ if unlock is called on an already unlocked lock (so the calling code is buggy which is why it's a won't fix in libpthread). Debug information for libpthread would be nice, or you could disassemble in gdb when it crashes to confirm it's actually calling that. I have no idea where that bug could be but should be worthwile to track it down, normally you don't get the luxury of a straight neat crash when the locking is wrong...
in fact the real culprit is llvm-3.7.0-4 and llvm-libs-3.7.0-4, because if I downgrade llvm and llvm-libs to the 3.6.2-4 version, and if I rebuild mesa 11.0.2 packages with llvm 3.6.2 then all is ok, no bugs, flightgear will not crash, I can run also LIBGL_ALWAYS_SOFTWARE=1 without problems, so there is something wrong in llvm 3.7.0, should I create a bugreport in llvm website ? because it seems a potential huge bug when llvm 3.7.0 will hit the other linux distros like ubuntu/fedora
I create a bugreport in llvm website : https://llvm.org/bugs/show_bug.cgi?id=25021 I hope they will find the fix for llvm 3.7.0
(In reply to Barto from comment #11) > in fact the real culprit is llvm-3.7.0-4 and llvm-libs-3.7.0-4, > > because if I downgrade llvm and llvm-libs to the 3.6.2-4 version, and if I > rebuild mesa 11.0.2 packages with llvm 3.6.2 then all is ok, no bugs, > flightgear will not crash, I can run also LIBGL_ALWAYS_SOFTWARE=1 without > problems, > > so there is something wrong in llvm 3.7.0, should I create a bugreport in > llvm website ? > > because it seems a potential huge bug when llvm 3.7.0 will hit the other > linux distros like ubuntu/fedora Does not necessarily mean it is a bug in llvm, maybe we're initializing it wrong or something (there were some recent patches to fix some threading issues in gallivm recently). Not my area of expertise, though...
(In reply to Roland Scheidegger from comment #13) > Does not necessarily mean it is a bug in llvm, maybe we're initializing it > wrong or something (there were some recent patches to fix some threading > issues in gallivm recently). Not my area of expertise, though... I tried the git version of mesa ( master branch ) --> same crash with llvm 3.7 what I am sure is that there is no problem with llvm 3.6.2 and mesa 11.x ( and even mesa 10.6 + llvm 3.6.2 ) here is the release note of llvm 3.7 : http://llvm.org/releases/3.7.0/docs/ReleaseNotes.html I still have no answer from llvm website about my bugreport, I will try to find the faulty commit in llvm 3.7.x
I didn't manage to find the faulty commit in llvm 3.8, the bisecting process is very slow and difficult ( I have a slow PC and there are a lot of svn revision to test ), (In reply to Roland Scheidegger from comment #13) > Does not necessarily mean it is a bug in llvm, maybe we're initializing it > wrong or something (there were some recent patches to fix some threading > issues in gallivm recently). Not my area of expertise, though... in fact I think Roland is right, the problem could be mesa 11.0.2 and not LLVM 3.8 if some LLVM functions are used in a bad way by mesa 11.0.2, it would be nice if someone who has a good knowledge in r600 driver can make some investigations, we have a gdb backtrace, we know that the problem occurs when a program uses the openGL function "glRasterPos2i()", and the problem occurs also when we use the 100% software mode ( export LIBGL_ALWAYS_SOFTWARE=1 ), and of course the problem occurs only with LLVM 3.8 libs ( and at least with the r600 driver ), there is no problem with LLVM 3.6.2, it would be great if this bug can be solved before LLVM 3.8 will hit others linux distros ( like ubuntu )
(In reply to Barto from comment #15) > and of course the problem occurs only with LLVM 3.8 libs ( and at least with > the r600 driver ), there is no problem with LLVM 3.6.2, > > it would be great if this bug can be solved before LLVM 3.8 will hit others > linux distros ( like ubuntu ) of course the problem occurs with LLVM 3.7 ( not only LLVM 3.8 )
The r600 driver does not use llvm.
(In reply to Alex Deucher from comment #17) > The r600 driver does not use llvm. here is what I see with ldd : $ ldd /usr/lib/xorg/modules/dri/r600_dri.so linux-vdso.so.1 (0x00007ffe8a9f7000) libpthread.so.0 => /usr/lib/libpthread.so.0 (0x00007efe0af45000) libdl.so.2 => /usr/lib/libdl.so.2 (0x00007efe0ad41000) libexpat.so.1 => /usr/lib/libexpat.so.1 (0x00007efe0ab17000) libdrm_nouveau.so.2 => /usr/lib/libdrm_nouveau.so.2 (0x00007efe0a90f000) libdrm_radeon.so.1 => /usr/lib/libdrm_radeon.so.1 (0x00007efe0a703000) libdrm_amdgpu.so.1 => /usr/lib/libdrm_amdgpu.so.1 (0x00007efe0a4fb000) libdrm.so.2 => /usr/lib/libdrm.so.2 (0x00007efe0a2eb000) libelf.so.1 => /usr/lib/libelf.so.1 (0x00007efe0a0d5000) libLLVM.so.3.7 => /usr/lib/libLLVM.so.3.7 (0x00007efe079c3000) libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007efe07640000) libm.so.6 => /usr/lib/libm.so.6 (0x00007efe07342000) libc.so.6 => /usr/lib/libc.so.6 (0x00007efe06f9e000) libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x00007efe06d87000) /usr/lib64/ld-linux-x86-64.so.2 (0x00005604503b7000) libffi.so.6 => /usr/lib/../lib/libffi.so.6 (0x00007efe06b7e000) libedit.so.0 => /usr/lib/../lib/libedit.so.0 (0x00007efe06941000) libncursesw.so.6 => /usr/lib/../lib/libncursesw.so.6 (0x00007efe066d4000) libz.so.1 => /usr/lib/../lib/libz.so.1 (0x00007efe064be000) libLLVM.so.3.7 seems used by r600_dri.so file, $ pacman -Qo /usr/lib/xorg/modules/dri/r600_dri.so /usr/lib/xorg/modules/dri/r600_dri.so is owned by mesa 11.0.2-1 in glxinfo I can see this : OpenGL renderer string: Gallium 0.4 on llvmpipe (LLVM 3.7, 128 bits) so what driver is really used by mesa 11.0.2 with my radeon HD4650 Pcie ?
Created attachment 118783 [details] strace of "tunnel" ( from mesa-demos ) tunnel ( from mesa-demos ) always crashes at start ( "illegal instruction" error ) when I use mesa 11.0.2 compiled with LLVM 3.7, here a strace file
Created attachment 118786 [details] backtrace of tunnel the gdb backtrace of tunnel ( from mesa-demos ) with mesa 11.0.2/LLVM 3.7
(In reply to Barto from comment #19) > Created attachment 118783 [details] > strace of "tunnel" ( from mesa-demos ) > > tunnel ( from mesa-demos ) always crashes at start ( "illegal instruction" > error ) when I use mesa 11.0.2 compiled with LLVM 3.7, > > here a strace file When you say "compiled with", do you mean "compiled by" or "linked to"? If you're using clang, try using gcc. People come into #nouveau every so often with totally random crashes in various packages, apparently related to using clang. r600g only uses llvm directly for opencl and a few extremely rare things like GL_SELECT.
Hello Ilia Mirkin, (In reply to Ilia Mirkin from comment #21) > When you say "compiled with", do you mean "compiled by" or "linked to"? If > you're using clang, try using gcc. People come into #nouveau every so often > with totally random crashes in various packages, apparently related to using > clang. > > r600g only uses llvm directly for opencl and a few extremely rare things > like GL_SELECT. I don't use clang, I use gcc 5.2.0, mesa 11.0.2 is compiled with gcc, and linked to libLLVM libs, I use the official archlinux mesa package, here you can see exactly how the mesa archlinux package is built : https://projects.archlinux.org/svntogit/packages.git/tree/trunk/PKGBUILD?h=packages/mesa they use GCC, I tried also to compile myself mesa ( 11.0.2 and the git version of mesa ) and I get the same result like the official archlinux mesa package : - Ok when mesa 11.0.2 is linked to llvm libs 3.6.2, no bugs in openGL programs - bug when mesa 11.0.2 is linked to llvm libs 3.7.0 ( crash in flightgear 3.4, and tunnel test program from mesa-demos, and it's impossible to use the software mode "LIBGL_ALWAYS_SOFTWARE=1" )
I notice that the file "configure" in mesa 11.0.3 source code is not compatible with LLVM 3.7.0 libs, the configure file will try to detect LLVM libs, it will work with LLVM 3.6.2 ( and below ) but not with LLVM 3.7.0 because the name of the so files has changed since 3.7.0 version ( "usr/lib/libLLVM.so" for LLVM 3.7.0, and "/usr/lib/libLLVM-3.6.2.so" for LLVM 3.6.2 ) if an user tries to compile mesa 11.0.3 ( with gcc ) and if he has LLVM 3.7.0 then the configure file will stop by saying "error :Could not find llvm shared libraries" archlinux developers have found a workaround with this patch, in order to link mesa 11.0.3 with LLVM 3.7.0 libs : # Fix detection of libLLVM when built with CMake sed -i 's/LLVM_SO_NAME=.*/LLVM_SO_NAME=LLVM/' configure https://projects.archlinux.org/svntogit/packages.git/tree/trunk/PKGBUILD?h=packages/mesa my question : Do the mesa developers have really tested mesa 11.0.3 linked with LLVM 3.7.0 ? ( especially with r600 driver ) it could explain my bug if some changes in LLVM 3.7.0 ( API, functions ) imply changes in mesa source code, a bad initialization of llvm libs
(In reply to Barto from comment #23) > archlinux developers have found a workaround with this patch, in order to > link mesa 11.0.3 with LLVM 3.7.0 libs : > > # Fix detection of libLLVM when built with CMake > sed -i 's/LLVM_SO_NAME=.*/LLVM_SO_NAME=LLVM/' configure Some people indeed use llvm 3.7 with mesa (quite a lot I would say). You'll notice right there ("when built with CMake") this patch is not always needed with llvm 3.7. The different name is really FUBAR but is beyond mesa's control as it only depends on how llvm is built... Albeit I don't know what the resolution to that llvm problem was if there's any yet.
(In reply to Roland Scheidegger from comment #24) > You'll notice right there ("when built with CMake") this patch is not always > needed with llvm 3.7. The different name is really FUBAR but is beyond > mesa's control as it only depends on how llvm is built... Albeit I don't > know what the resolution to that llvm problem was if there's any yet. CMAKE might be the culprit if the generation of libLLVM.so file ( 3.7.0 version ) has problems, I found two interesting links about the "CMAKE way" when building LLVM 3.7.0 : https://llvm.org/bugs/show_bug.cgi?id=23649 http://lists.freedesktop.org/archives/mesa-dev/2015-September/095696.html and I found also that archlinux developers for the 3.6.2 version of LLVM have used the classic way ( ./configure, make, make install ) for building LLVM 3.6.2, then for the 3.7.0 version they have switched ( archlinux developers ) to the CMAKE build method, LLVM 3.6.2 lib works without problems with mesa 11.0.3, LLVM 3.7.0 lib triggers a crash when mesa 11.0.3 tries to use it ( when r600 driver is used with an amd radeon HD4650 pcie card ), it's just a supposition : my bug could be the consequence of the switch to the CMAKE build method, something went wrong in CMAKE files configuration in LLVM 3.7.0, to be sure I am currently trying to compile LLVM 3.7.0 with the classic way ( ./configure ) by reverting this patch in order to allow this : http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20150629/284970.html and I will see if the bug is still here when libLLVM.so.3.7 is building with ./configure and not CMAKE
(In reply to Barto from comment #25) > to be sure I am currently trying to compile LLVM 3.7.0 with the classic way > ( ./configure ) by reverting this patch in order to allow this : > > http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20150629/284970.html > > and I will see if the bug is still here when libLLVM.so.3.7 is building with > ./configure and not CMAKE I tried : the bug is still here, even with an autoconf build, so the bug is elsewhere, maybe in mesa source code ( bad initialization of llvm 3.7.0 lib ) , or in LLVM 3.7.0, could be also in glibc 2.22-3 but I doubt, here is a new backtrace of the tunnel program ( from mesa-demos ), here I use a debug version of glibc 2.22-3 : (gdb) thread apply all bt full Thread 2 (Thread 0x7fffee4c6700 (LWP 11292)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 No locals. #1 0x00007ffff22fe04a in ?? () from /usr/lib/xorg/modules/dri/r600_dri.so No symbol table info available. #2 0x00007ffff22fd787 in ?? () from /usr/lib/xorg/modules/dri/r600_dri.so No symbol table info available. #3 0x00007ffff3d74464 in start_thread (arg=0x7fffee4c6700) at pthread_create.c:334 __res = <optimized out> pd = 0x7fffee4c6700 now = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140737191372544, -3511726595263618993, 0, 140737488345551, 3, 140737488348064, 3511694852238942287, 3511735595139276879}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <optimized out> pagesize_m1 = <optimized out> sp = <optimized out> freesize = <optimized out> __PRETTY_FUNCTION__ = "start_thread" #4 0x00007ffff704d13d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 No locals. Thread 1 (Thread 0x7ffff7f9a740 (LWP 11286)): #0 0x00007fffedaa55eb in ?? () No symbol table info available. #1 0x000000000076a320 in ?? () No symbol table info available. #2 0x00007fffffffd630 in ?? () No symbol table info available. #3 0x0000000000000007 in ?? () No symbol table info available. ---Type <return> to continue, or q <return> to quit--- #4 0x0000000000000000 in ?? () No symbol table info available. it would be interesting to put a breakpoint in mesa 11.0.3 source code, a location where LLVM 3.7.0 functions are used or loaded, my purpose is to know if the bug occurs before or after mesa tries to use LLVM functions, and if it's a problem related to glibc or gcc libs, but I don't know where I should put this breakpoint, another test would be to create a simple program test, for example a simple test.c program who loads LLVM 3.7.0 lib and try to run a simple LLVM function, the idea is to check if LLVM 3.7.0 lib can be loaded and used without a crash
I made an interesting discovery : the bug occurs also in a virtual machine ( qemu i686, OS guest : archlinux i686 ), in this virtual machine it's not the r600 driver who is used, it's the swrast_dri.so file ( 100% emulation software, no 3D acceleration ), in this virtual machine all openGL programs crash ( glxgears for example ), with the error "illegal instruction", this qemu i686 virtual machine runs in my PC ( OS host : archlinux 64 bits, CPU: pentium dual core E6800 ), glxinfo for this qemu VM : name of display: :0 display: :0 screen: 0 direct rendering: Yes server glx vendor string: SGI server glx version string: 1.4 OpenGL vendor string: VMware, Inc. OpenGL renderer string: Gallium 0.4 on llvmpipe (LLVM 3.7, 128 bits) OpenGL version string: 3.0 Mesa 11.0.3 OpenGL shading language version string: 1.30 log of Xorg : [ 13.255] (WW) Open ACPI failed (/var/run/acpid.socket) (No such file or directory) [ 13.533] (II) Loading /usr/lib/xorg/modules/extensions/libglx.so [ 13.962] (II) Loading /usr/lib/xorg/modules/drivers/vmware_drv.so [ 14.948] (II) Loading /usr/lib/xorg/modules/drivers/modesetting_drv.so [ 14.978] (II) Loading /usr/lib/xorg/modules/drivers/fbdev_drv.so [ 15.010] (II) Loading /usr/lib/xorg/modules/drivers/vesa_drv.so [ 15.060] (II) Loading /usr/lib/xorg/modules/libfbdevhw.so [ 15.270] (II) Loading /usr/lib/xorg/modules/libvgahw.so [ 15.281] (==) vmware(0): Using HW cursor [ 15.282] (II) Loading /usr/lib/xorg/modules/libfb.so [ 15.367] (II) Loading /usr/lib/xorg/modules/libshadowfb.so [ 13.962] (II) Loading /usr/lib/xorg/modules/drivers/vmware_drv.so [ 14.948] (II) Loading /usr/lib/xorg/modules/drivers/modesetting_drv.so [ 14.978] (II) Loading /usr/lib/xorg/modules/drivers/fbdev_drv.so [ 15.010] (II) Loading /usr/lib/xorg/modules/drivers/vesa_drv.so [ 15.053] (II) vmware: driver for VMware SVGA: vmware0405, vmware0710 [ 15.053] (II) FBDEV: driver for framebuffer: fbdev [ 15.053] (II) VESA: driver for VESA chipsets: vesa the mesa driver seems to be swrast_dri.so, the backtrace is still the same : Starting program: /usr/bin/glxgears [Thread debugging using libthread_db enabled] Using host libthread_db library "/usr/lib/libthread_db.so.1". [New Thread 0xb450eb40 (LWP 839)] [New Thread 0xb3d0db40 (LWP 840)] Program received signal SIGILL, Illegal instruction. 0xb7fd2091 in ?? () Thread 3 (Thread 0xb3d0db40 (LWP 840)): #0 0xb7fdbbc8 in __kernel_vsyscall () No symbol table info available. #1 0xb7a7da2b in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0 No symbol table info available. #2 0xb7c8de4d in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libc.so.6 No symbol table info available. #3 0xb757940a in ?? () from /usr/lib/xorg/modules/dri/swrast_dri.so No symbol table info available. #4 0xb7579275 in ?? () from /usr/lib/xorg/modules/dri/swrast_dri.so No symbol table info available. #5 0xb7a78315 in start_thread () from /usr/lib/libpthread.so.0 No symbol table info available. #6 0xb7c80e1e in clone () from /usr/lib/libc.so.6 No symbol table info available. Thread 2 (Thread 0xb450eb40 (LWP 839)): #0 0xb7fdbbc8 in __kernel_vsyscall () No symbol table info available. #1 0xb7a7da2b in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0 No symbol table info available. #2 0xb7c8de4d in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libc.so.6 No symbol table info available. #3 0xb757940a in ?? () from /usr/lib/xorg/modules/dri/swrast_dri.so No symbol table info available. #4 0xb7579275 in ?? () from /usr/lib/xorg/modules/dri/swrast_dri.so No symbol table info available. #5 0xb7a78315 in start_thread () from /usr/lib/libpthread.so.0 No symbol table info available. #6 0xb7c80e1e in clone () from /usr/lib/libc.so.6 No symbol table info available. Thread 1 (Thread 0xb7a5f700 (LWP 838)): #0 0xb7fd2091 in ?? () No symbol table info available. #1 0xb7367986 in ?? () from /usr/lib/xorg/modules/dri/swrast_dri.so No symbol table info available. #2 0xb7367d36 in ?? () from /usr/lib/xorg/modules/dri/swrast_dri.so No symbol table info available. #3 0xb729bc19 in ?? () from /usr/lib/xorg/modules/dri/swrast_dri.so No symbol table info available. #4 0xb72942e3 in ?? () from /usr/lib/xorg/modules/dri/swrast_dri.so No symbol table info available. #5 0xb72948c6 in ?? () from /usr/lib/xorg/modules/dri/swrast_dri.so No symbol table info available. #6 0xb7577813 in ?? () from /usr/lib/xorg/modules/dri/swrast_dri.so No symbol table info available. #7 0xb72820fd in ?? () from /usr/lib/xorg/modules/dri/swrast_dri.so No symbol table info available. #8 0xb713d166 in ?? () from /usr/lib/xorg/modules/dri/swrast_dri.so No symbol table info available. #9 0xb7125f5a in ?? () from /usr/lib/xorg/modules/dri/swrast_dri.so No symbol table info available. #10 0xb6ff0600 in ?? () from /usr/lib/xorg/modules/dri/swrast_dri.so No symbol table info available. #11 0xb7004b40 in ?? () from /usr/lib/xorg/modules/dri/swrast_dri.so No symbol table info available. #12 0x08049fdb in ?? () No symbol table info available. #13 0x080496ca in ?? () No symbol table info available. #14 0xb7baf497 in __libc_start_main () from /usr/lib/libc.so.6 No symbol table info available. #15 0x08049d0a in ?? () No symbol table info available. what is your opinion about these new infos ?
another discovery : in qemu I can set a type of CPU ( pentium, pentium2, pentium2, core2duo, SandyBridge and many more ), you can see the CPUs list with the command "qemu-i386 -cpu ?", until now I used the qemu option "-cpu host", which means that it's the CPU of the host who is emulated ( my pentium dual core E6800 ), then I decided to set a different CPU name in my qemu script : -cpu core2duo -enable-kvm -machine type=pc,accel=kvm -smp 2 with this setting the bug disapears, all is ok in my virtual machine, glxgears and all openGL programs can run without crash, the mesa driver llvmpipe doesn't crash, after that I decided to do set again another CPU in qemu : -cpu Penryn -enable-kvm -machine type=pc,accel=kvm -smp 2 \ with "Penryn" CPU the bug is back in my virtual machine, which means that the bug seems related to the type of CPU, llvm 3.7.0 lib may have a bug when he tries to generate binary code, it fails with some CPUs, this problem doesn't exist with llvm 3.6.2 lib
like Roland said in comment#13 the bug could be in mesa source code, an incorrect initialization of llvm 3.7.0, it worked in 3.6.2, but perhaps in 3.7.0 the same code is not enough, it needs maybe more precise instructions in order to target exactly the CPU platform, to produce correct cpu opcodes, in dmesg I can see these messages when the SIGILL occurs : [13766.649327] traps: tunnel[4095] trap invalid opcode ip:7f68ce8d7183 sp:7fff6c078700 error:0 [35876.638782] traps: llvmpipe-1[8766] trap invalid opcode ip:f77271bd sp:f3536db0 error:0 [35876.638785] traps: llvmpipe-0[8765] trap invalid opcode ip:f77271bd sp:f3d37db0 error:0
I found the cause of this bug, it's llvm 3.7.0, the llvm git commit who has introduced this bug is : cd83d5b5071f072882ad06cc4b904b2d27d1e54a https://github.com/llvm-mirror/llvm/commit/cd83d5b5071f072882ad06cc4b904b2d27d1e54a the problem is that llvm 3.7.0 treats my pentium dual core as a "penryn", penryn supports SSE4, but not the pentium dual core series ( CPU family 6 model 23 ), the faulty commit has deleted a test about SSE4 : return HasSSE41 ? "penryn" : "core2"; the solution is simply to add this test for CPU family 6 model 23, I created a patch who solves this bug : --- a/lib/Support/Host.cpp 2015-10-14 07:13:52.381374679 +0200 +++ b/lib/Support/Host.cpp 2015-10-14 07:13:28.224708323 +0200 @@ -332,6 +332,8 @@ // 17h. All processors are manufactured using the 45 nm process. // // 45nm: Penryn , Wolfdale, Yorkfield (XE) + // Not all Penryn processors support SSE 4.1 (such as the Pentium brand) + return HasSSE41 ? "penryn" : "core2"; case 29: // Intel Xeon processor MP. All processors are manufactured using // the 45 nm process. return "penryn"; this patch has been sent to llvm's bugzilla, I hope they will accept it
llvm developper ( Craig Topper ) has answered this : https://llvm.org/bugs/show_bug.cgi?id=25021#c8 he said that the remove of the SSE4 test for penryn CPUs ( including pentium dual core who don't support SSE4 ) was intentional, he advices mesa developpers to use the llvm function "getHostCPUFeatures()" in order to check if the CPU can support SSE4, before doing a llvm operation ( like a binary code generation ), I check in mesa source code and it seems that mesa developpers don't use this function "getHostCPUFeatures()", I am not a specialist in llvm API, but it seems that there is a change since llvm 3.7.0, developpers like mesa should now always check if the host CPU has some features like SSE4, because without no check llvm will try to apply generic CPU settings fixed in the file /lib/Target/X86.td, in this file there are a sort of definitions of features for each CPU : / Intel Core 2 Solo/Duo. def : ProcessorModel<"core2", SandyBridgeModel, [FeatureSSSE3, FeatureCMPXCHG16B, FeatureSlowBTMem]>; def : ProcessorModel<"penryn", SandyBridgeModel, [FeatureSSE41, FeatureCMPXCHG16B, FeatureSlowBTMem]>; we see here that by default all penryn CPU are treated by llvm as "SSE4 ready", which triggers a bug with pentium dual core ( cpu family 6 model 23 ) if no sanity checks ( like SSE4 test ) have been done by the application who uses llvm libs
(In reply to Barto from comment #31) > llvm developper ( Craig Topper ) has answered this : > > https://llvm.org/bugs/show_bug.cgi?id=25021#c8 > > he said that the remove of the SSE4 test for penryn CPUs ( including pentium > dual core who don't support SSE4 ) was intentional, > > he advices mesa developpers to use the llvm function "getHostCPUFeatures()" > in order to check if the CPU can support SSE4, before doing a llvm operation > ( like a binary code generation ), > > I check in mesa source code and it seems that mesa developpers don't use > this function "getHostCPUFeatures()", We do our own feature detection, though so far we didn't really tell llvm which features it's allowed to use as we were relying on this getting detected automatically. I believe the patch posted by Jose here, http://lists.freedesktop.org/archives/mesa-dev/2015-October/097948.html should help. (Though this bug appears to have nothing to do at all with the originally posted issue)
(In reply to Roland Scheidegger from comment #32) > I believe the patch posted by Jose here, > http://lists.freedesktop.org/archives/mesa-dev/2015-October/097948.html > should help. > I tested the patch posted by Jose, it doesn't solve the bug, the crash is still here ( "illegal instruction" ), we need to be sure that the "SSE4" argument is not passed to the llvm compiler by mesa if the CPU doesn't support SSE4, for llvm 3.7.0 my cpu name is "penryn" and it seems that SSE4 is enabled by default for penryn cpu in llvm default settings, that's why a check must be done by mesa in order to avoid this crash ( illegal opcode cpu ) when a pentium dual core is used, the SSE4 argument should not be passed to the llvm compiler if the CPU doesn't support it, I don't know exactly what is wrong in mesa source code, if the check related to SSE4 is really done and if mesa tries to pass to llvm the good cpu features arguments ( SSEx, MMX, AVX... ), for now the workaround I found is to patch LLVM 3.7.0 in order to re-add the "SSE4 test", my CPU name will be after this patch : "core2", which was the default behaviour in previous versions of LLVM ( like 3.6.2 version ) : --- a/lib/Support/Host.cpp 2015-10-14 07:13:52.381374679 +0200 +++ b/lib/Support/Host.cpp 2015-10-14 07:13:28.224708323 +0200 @@ -332,6 +332,8 @@ // 17h. All processors are manufactured using the 45 nm process. // // 45nm: Penryn , Wolfdale, Yorkfield (XE) + // Not all Penryn processors support SSE 4.1 (such as the Pentium brand) + return HasSSE41 ? "penryn" : "core2"; case 29: // Intel Xeon processor MP. All processors are manufactured using // the 45 nm process. return "penryn";
Craig Tooper has made a suggestion who solves the problem : the idea is to "remove" the unsupported CPU features, by adding "-sse4.1" in MAttrs object when "util_cpu_caps.has_sse4_1 == false", so if I add this to Jose's patch the bug is solved : + if (!util_cpu_caps.has_sse4_1) { +#if HAVE_LLVM >= 0x0304 + MAttrs.push_back("-sse4.1"); +#else + MAttrs.push_back("-sse41"); +#endif + } this logic is not really natural for a developper who wants to use llvm lib, this developper would think that llvm will never use an unsupported cpu feature if this developper only passes good cpu features to the compiler, it seems that llvm will try to use by himself SSE4.1 even if the developper didn't add explicitely "+sse4.1" in his source code, we have this problem because llvm 3.7.0 treats pentium dual core cpu as "penryn" cpu, "penryn" supports SSE4 but not pentium dual core, in my logic llvm should be more stric, rigorous when he tries to associate a cpu with a cpu name, a cpu name should reflect exactly the cpu features, maybe a better solution would be to create a new cpu name in llvm source code, a cpu name who targets only cpu family 6 model 23 : "dualcore" in order to avoid this SSE4 problem, but I am not a llvm specialist
Created attachment 119117 [details] patch who solves the bug by removing explicitely SSE4 when the CPU doesn't support SSE4 this the modified version of the Jose's patch, this patch who solves the bug by removing explicitely SSE4 when the CPU doesn't support SSE4, the new lines : + if (!util_cpu_caps.has_sse4_1) { +#if HAVE_LLVM >= 0x0304 + MAttrs.push_back("-sse4.1"); +#else + MAttrs.push_back("-sse41"); +#endif + }
I don't think llvm's behavior makes sense. We got the cpu name from llvm, that we have to manually list cpu features which it CAN'T use when just using that name then is imho crazy. I've updated the llvm bug accordingly.
(In reply to Roland Scheidegger from comment #36) > I don't think llvm's behavior makes sense. We got the cpu name from llvm, > that we have to manually list cpu features which it CAN'T use when just > using that name then is imho crazy. I've updated the llvm bug accordingly. The fact is that the cpu name is ambigous, so whether LLVM takes the "usually" support features, vs the "minimally" supported features is really a matter of convention. We should set negative flags where appropriate. My concern is whether passing "-sse4_1" to a non-Intel CPU will cause problems. A quick check with altivec shows that's the case: '-altivec' is not a recognized feature for this target (ignoring feature) I'll attach a patch that should fix this.
Created attachment 119141 [details] [review] llvm-mattrs.patch Comprehensive fix.
thanks Jose for this patch just for the record : in 2013 in llvm's bugzilla someone has already opened a bug report about the same problem ( his pentium dual core was treaten as penryn instead of core2 by llvm, which triggers bug about SSE4.1 ) : https://llvm.org/bugs/show_bug.cgi?id=16721 at that time the solution found by Benjamin Kramer was to fix the problem by adding the "SSE4.1 test" in /lib/Support/Host.cpp : http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20130729/182469.html unfortunately this fix was deleted by Craig Cooper with his commit cd83d5b5071f072882ad06cc4b904b2d27d1e54a in march 2015 gcc seems to handle this problem differently, my pentium dual core is identified as "core2" when I use "-march=native" : $ gcc -march=native -Q --help=target | grep march -march= core2 $ gcc -march=native -Q --help=target | grep sse -mno-sse4 [enabled] -msse [enabled] -msse2 [enabled] -msse2avx [disabled] -msse3 [enabled] -msse4 [disabled] -msse4.1 [disabled] -msse4.2 [disabled] -msse4a [disabled] -msse5 -msseregparm [disabled] -mssse3 [enabled] but with clang 3.7.0 the "-march=native" argument leads to "cpu penryn" with maybe the use of sse4.1 even if the CPU is a pentium dual core who doesn't support sse4.1 $ clang -v -E -march=native - "/usr/bin/clang-3.7" -cc1 -triple x86_64-unknown-linux-gnu -E -disable-free -disable-llvm-verifier -main-file-name - -mrelocation-model static -mthread-model posix -mdisable-fp-elim -fmath-errno -masm-verbose -mconstructor-aliases -munwind-tables -fuse-init-array -target-cpu penryn -target-feature -sse4a -target-feature -avx512bw -target-feature +cx16 -target-feature -tbm -target-feature -adx -target-feature -fma4 -target-feature -avx512vl -target-feature -prfchw -target-feature -bmi2 -target-feature -avx512pf -target-feature -fsgsbase -target-feature -avx -target-feature -avx512cd -target-feature -rtm -target-feature -popcnt -target-feature -fma -target-feature -bmi -target-feature -aes -target-feature -rdrnd -target-feature -sse4.1 -target-feature -sse4.2 -target-feature -avx2 -target-feature -avx512er -target-feature +sse -target-feature -lzcnt -target-feature -pclmul -target-feature -avx512f -target-feature -f16c -target-feature +ssse3 -target-feature +mmx -target-feature +cmov
I pushed the patch, and listed all the referred LLVM PRs in a comment for future reference.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.