Bug 40423 - Firefox GLX test crashes with Mesa 7.12-devel, Gallium 0.4 on AMD JUNIPER
Summary: Firefox GLX test crashes with Mesa 7.12-devel, Gallium 0.4 on AMD JUNIPER
Status: RESOLVED NOTOURBUG
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Gallium/r600 (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL: https://crash-stats.mozilla.com/repor...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-08-27 11:21 UTC by Benoit Jacob
Modified: 2011-12-29 09:16 UTC (History)
3 users (show)

See Also:
i915 platform:
i915 features:


Attachments
Stand-alone version of Mozilla's glxtext (9.75 KB, text/plain)
2011-08-29 11:31 UTC, Ian Romanick
Details
Stand-alone version of Mozilla's glxtext, v2 (9.58 KB, text/x-c++src)
2011-08-30 14:00 UTC, Benoit Jacob
Details
Stand-alone version of Mozilla's glxtext, v3 (9.58 KB, text/x-c++src)
2011-08-30 14:55 UTC, Benoit Jacob
Details

Description Benoit Jacob 2011-08-27 11:21:33 UTC
Just got this Firefox crash report:
https://crash-stats.mozilla.com/report/index/4a1da9e3-fef3-4ce6-be03-ce07d2110827

The Firefox crash itself is not relevant to Mesa/ATI, but this report also says (look at App Notes) that the little GLX test process that Firefox runs at startup also crashed:

App Notes 

GLXtest process failed (received signal 11):
VENDOR
  X.Org
RENDERER
  Gallium 0.4 on AMD JUNIPER
VERSION
  2.1 Mesa 7.12-devel (git-5379a70)
TFP
  TRUE

You can find the source code of this test here:
http://hg.mozilla.org/mozilla-central/file/d0700ba932b4/toolkit/xre/glxtest.cpp#l107

It's even simpler than 'glxinfo' but differs in that it uses a GLX pixmap instead of a XWindow to back the GL context.

As the 'App Notes' say, this process crashed on signal 11 (segmentation fault)

In the crash report, by clicking the Modules tab you can get version info for the libraries loaded in the Firefox process. For example:
   libX11.so.6.3.0
   libGL.so.1.2

Unfortunately I don't have any more information about it: no stack, no more device information.

Note that Firefox itself is "fine", the present bug is only making the GLX test process crash, Firefox notices that and as a result disables GL-dependent features. But it would be nice to not have to disable these features ;-)
Comment 1 Benoit Jacob 2011-08-27 11:22:53 UTC
Notice that besides looking at the source code (above link) you can also try this yourself just by running Firefox. Any version >= 6 will do, but the present crash report is on Nightly (9):

http://nightly.mozilla.org/
Comment 2 Benoit Jacob 2011-08-27 11:24:27 UTC
... to check this in Firefox, go to the about:support page and scroll down to Graphics. If the GLX test process crashed, this will be explicitly reported there, like "GLXtest process failed (received signal 11)"
Comment 3 Michel Dänzer 2011-08-29 11:04:17 UTC
(In reply to comment #3)
> You can find the source code of this test here:
> http://hg.mozilla.org/mozilla-central/file/d0700ba932b4/toolkit/xre/glxtest.cpp#l107

It has at least one obvious flaw: dlsym() can only be used to retrieve glXGetProcAddress(ARB) and the symbols defined by the Linux OpenGL ABI (basically OpenGL 1.2 + GL_ARB_multitexture). All other GL(X) symbols need to be retrieved with glXGetProcAddress(ARB).

I didn't look further than that.


> Unfortunately I don't have any more information about it: no stack, no more
> device information.

Even assuming the crash wasn't due to a test bug, it'll be hard to do anything about it without that information.
Comment 4 Ian Romanick 2011-08-29 11:31:07 UTC
(In reply to comment #3)
> (In reply to comment #3)
> > You can find the source code of this test here:
> > http://hg.mozilla.org/mozilla-central/file/d0700ba932b4/toolkit/xre/glxtest.cpp#l107
> 
> It has at least one obvious flaw: dlsym() can only be used to retrieve
> glXGetProcAddress(ARB) and the symbols defined by the Linux OpenGL ABI
> (basically OpenGL 1.2 + GL_ARB_multitexture). All other GL(X) symbols need to
> be retrieved with glXGetProcAddress(ARB).

That is kind of odd, but I don't think it's the problem.  The only GL symbol that they use dlsym on is glGetString.  The library has to statically export that. :)  glXQueryExtension, glXQueryVersion, glXCreatePixmap, glXMakeCurrent, glXDestroyPixmap, and glXDestroyContext are in the same boat.

Technically speaking, glXChooseFBConfig, glXGetVisualFromFBConfig, and glXCreateNewContext should be retrieved via glXGetProcAddress.  If dlsym returns non-NULL, it seems unlikely to be a problem on our stack.  Since is already done for glXBindTexImageEXT, it is a good idea to use glXGetProcAddress for these functions as well.

> I didn't look further than that.
> 
> 
> > Unfortunately I don't have any more information about it: no stack, no more
> > device information.
> 
> Even assuming the crash wasn't due to a test bug, it'll be hard to do anything
> about it without that information.

I hacked up a stand-alone version, but I wasn't able to reproduce the crash on the Intel drivers.
Comment 5 Ian Romanick 2011-08-29 11:31:41 UTC
Created attachment 50665 [details]
Stand-alone version of Mozilla's glxtext
Comment 6 Jure Repinc 2011-08-29 15:43:47 UTC
Looks like this crash report came from me. With Nightly from today and Mesa 7.12-devel (git-110f846) I still get "GLXtest process failed (received signal 11)" in about:support. But if I download the glxtest attachment from Ian and compile it with "g++ -ldl -lX11 glxtest.cpp -o glxtest" I don't get any crash, it exits normally.
Oh, and the hardware is AMD Radeon HD 5750
Comment 7 Benoit Jacob 2011-08-30 13:55:01 UTC
Hi Jure :-)

Thanks Ian and Michel, will use glXGetProcAddress.
Comment 8 Benoit Jacob 2011-08-30 13:56:59 UTC
(In reply to comment #6)
> Looks like this crash report came from me. With Nightly from today and Mesa
> 7.12-devel (git-110f846) I still get "GLXtest process failed (received signal
> 11)" in about:support. But if I download the glxtest attachment from Ian and
> compile it with "g++ -ldl -lX11 glxtest.cpp -o glxtest" I don't get any crash,
> it exits normally.

That's really weird isn't it. Can you run the standalone glxtest, compiled with -g3, in valgrind --tool=memcheck so we see if it's hitting uninitialized memory?
Comment 9 Benoit Jacob 2011-08-30 14:00:11 UTC
Created attachment 50707 [details]
Stand-alone version of Mozilla's glxtext, v2

The original version used an uninitialized value for the file descriptor it was writing to. This version now uses standard output, so one can see what information glxtest is sending to Firefox.
Comment 10 Benoit Jacob 2011-08-30 14:12:23 UTC
(In reply to comment #9)
> Created an attachment (id=50707) [details]
> Stand-alone version of Mozilla's glxtext, v2
> 
> The original version used an uninitialized value

er, sorry, it used -1, but i still guess that using stdout is better.
Comment 11 Julien Cristau 2011-08-30 14:33:54 UTC
> --- Comment #9 from Benoit Jacob <bjacob@mozilla.com> 2011-08-30 14:00:11 PDT ---
> The original version used an uninitialized value for the file descriptor it was
> writing to. This version now uses standard output, so one can see what
> information glxtest is sending to Firefox.
> 
Not that it matters much, but stdout is 1, not 0 ;)
Comment 12 Benoit Jacob 2011-08-30 14:36:51 UTC
(In reply to comment #11)
> > --- Comment #9 from Benoit Jacob <bjacob@mozilla.com> 2011-08-30 14:00:11 PDT ---
> > The original version used an uninitialized value for the file descriptor it was
> > writing to. This version now uses standard output, so one can see what
> > information glxtest is sending to Firefox.
> > 
> Not that it matters much, but stdout is 1, not 0 ;)

Oops! You're right. Now I'm totally confused as to why my change actually made it print stuff to stdout.
Comment 13 Benoit Jacob 2011-08-30 14:55:44 UTC
Created attachment 50724 [details]
Stand-alone version of Mozilla's glxtext, v3
Comment 14 Jure Repinc 2011-09-01 05:10:06 UTC
This is what valgrind gives on my laptop with ATI Mobility Radeon 5470:

$ valgrind --tool=memcheck ./glxtest
==20468== Memcheck, a memory error detector
==20468== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==20468== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright info
==20468== Command: ./glxtest
==20468== 
==20468== Invalid read of size 4
==20468==    at 0x52341B2: _mesa_make_extension_string (in /usr/lib64/mesa/r600g_dri.so)
==20468==    by 0x5246397: _mesa_make_current (in /usr/lib64/mesa/r600g_dri.so)
==20468==    by 0x5236B01: st_api_make_current (in /usr/lib64/mesa/r600g_dri.so)
==20468==    by 0x521F352: dri_make_current (in /usr/lib64/mesa/r600g_dri.so)
==20468==    by 0x51FF655: driBindContext (in /usr/lib64/mesa/r600g_dri.so)
==20468==    by 0x3C14056B0E: dri2_bind_context (in /usr/lib64/opengl/xorg-x11/lib/libGL.so.1.2)
==20468==    by 0x3C1403032D: MakeContextCurrent (in /usr/lib64/opengl/xorg-x11/lib/libGL.so.1.2)
==20468==    by 0x400F1C: glxtest() (glxtest.cpp:193)
==20468==    by 0x4010B2: main (glxtest.cpp:239)
==20468==  Address 0x4f157b0 is 0 bytes inside a block of size 1 alloc'd
==20468==    at 0x4A063E4: calloc (vg_replace_malloc.c:467)
==20468==    by 0x5234429: _mesa_make_extension_string (in /usr/lib64/mesa/r600g_dri.so)
==20468==    by 0x5246397: _mesa_make_current (in /usr/lib64/mesa/r600g_dri.so)
==20468==    by 0x5236B01: st_api_make_current (in /usr/lib64/mesa/r600g_dri.so)
==20468==    by 0x521F352: dri_make_current (in /usr/lib64/mesa/r600g_dri.so)
==20468==    by 0x51FF655: driBindContext (in /usr/lib64/mesa/r600g_dri.so)
==20468==    by 0x3C14056B0E: dri2_bind_context (in /usr/lib64/opengl/xorg-x11/lib/libGL.so.1.2)
==20468==    by 0x3C1403032D: MakeContextCurrent (in /usr/lib64/opengl/xorg-x11/lib/libGL.so.1.2)
==20468==    by 0x400F1C: glxtest() (glxtest.cpp:193)
==20468==    by 0x4010B2: main (glxtest.cpp:239)
==20468== 
VENDOR
X.Org
RENDERER
Gallium 0.4 on AMD CEDAR
VERSION
2.1 Mesa 7.12-devel (git-82fff5f)
TFP
TRUE
==20468== 
==20468== HEAP SUMMARY:
==20468==     in use at exit: 212,008 bytes in 531 blocks
==20468==   total heap usage: 3,105 allocs, 2,574 frees, 2,729,162 bytes allocated
==20468== 
==20468== LEAK SUMMARY:
==20468==    definitely lost: 1,948 bytes in 4 blocks
==20468==    indirectly lost: 0 bytes in 0 blocks
==20468==      possibly lost: 120 bytes in 3 blocks
==20468==    still reachable: 209,940 bytes in 524 blocks
==20468==         suppressed: 0 bytes in 0 blocks
==20468== Rerun with --leak-check=full to see details of leaked memory
==20468== 
==20468== For counts of detected and suppressed errors, rerun with: -v
==20468== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 11 from 9)
Comment 15 Jure Repinc 2011-09-01 05:29:39 UTC
Ups forgot the debugging symbols for mesa:

$ valgrind --tool=memcheck ./glxtest
==28048== Memcheck, a memory error detector
==28048== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==28048== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright info
==28048== Command: ./glxtest
==28048== 
==28048== Invalid read of size 4
==28048==    at 0x54B71B2: _mesa_make_extension_string (extensions.c:912)
==28048==    by 0x54C9397: _mesa_make_current (context.c:1514)
==28048==    by 0x54B9B01: st_api_make_current (st_manager.c:789)
==28048==    by 0x54A2352: dri_make_current (dri_context.c:205)
==28048==    by 0x5482655: driBindContext (dri_util.c:196)
==28048==    by 0x509FB0E: dri2_bind_context (dri2_glx.c:151)
==28048==    by 0x507932D: MakeContextCurrent (glxcurrent.c:269)
==28048==    by 0x400F1C: glxtest() (glxtest.cpp:193)
==28048==    by 0x4010B2: main (glxtest.cpp:239)
==28048==  Address 0x4f157b0 is 0 bytes inside a block of size 1 alloc'd
==28048==    at 0x4A063E4: calloc (vg_replace_malloc.c:467)
==28048==    by 0x54B7429: _mesa_make_extension_string (extensions.c:776)
==28048==    by 0x54C9397: _mesa_make_current (context.c:1514)
==28048==    by 0x54B9B01: st_api_make_current (st_manager.c:789)
==28048==    by 0x54A2352: dri_make_current (dri_context.c:205)
==28048==    by 0x5482655: driBindContext (dri_util.c:196)
==28048==    by 0x509FB0E: dri2_bind_context (dri2_glx.c:151)
==28048==    by 0x507932D: MakeContextCurrent (glxcurrent.c:269)
==28048==    by 0x400F1C: glxtest() (glxtest.cpp:193)
==28048==    by 0x4010B2: main (glxtest.cpp:239)
==28048== 
VENDOR
X.Org
RENDERER
Gallium 0.4 on AMD CEDAR
VERSION
2.1 Mesa 7.12-devel (git-11f6466)
TFP
TRUE
==28048== 
==28048== HEAP SUMMARY:
==28048==     in use at exit: 212,008 bytes in 531 blocks
==28048==   total heap usage: 3,105 allocs, 2,574 frees, 2,729,162 bytes allocated
==28048== 
==28048== LEAK SUMMARY:
==28048==    definitely lost: 1,948 bytes in 4 blocks
==28048==    indirectly lost: 0 bytes in 0 blocks
==28048==      possibly lost: 120 bytes in 3 blocks
==28048==    still reachable: 209,940 bytes in 524 blocks
==28048==         suppressed: 0 bytes in 0 blocks
==28048== Rerun with --leak-check=full to see details of leaked memory
==28048== 
==28048== For counts of detected and suppressed errors, rerun with: -v
==28048== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 11 from 9)
Comment 16 Jure Repinc 2011-09-24 04:43:18 UTC
Is there anything else I can test to help fix this?
Comment 17 Benoit Jacob 2011-09-25 09:18:10 UTC
(In reply to comment #16)
> Is there anything else I can test to help fix this?

Looking at the Valgrind error from comment 15, it suggests that _mesa_make_extension_string (extensions.c:912) makes the assumption that it's safe to read 4 bytes from a string and that assumption fails as the string is empty with just 1 byte allocated (and set to 0) by calloc.

This kind of error is very commonly found in string manipulation and is generally considered safe to ignore as it can hardly lead to a crash. Indeed, AFAIK it could only crash if reading those 4 bytes led to crossing a (4k) page boundary, which would mean that the pointer was not even 4-byte-aligned... It's not inconceivable to have a Firefox-specific crash there as Firefox uses a custom memory allocator (jemalloc).

What you could do to investigate this bug, if you can recompile Mesa, is add a printf in _mesa_make_extension_string (extensions.c:912) to print this pointer. And perhaps add a printf at the next line to see if it's there that we crash.
Comment 18 Benoit Jacob 2011-09-25 09:20:37 UTC
... or just use gdb to do the same. Not sure why I was recommending printf-based debugging ;-) oh yes, it might be a little tricky with gdb as this is in a child process, you might have to break on fork to get the child process PID and then attach gdb to the child process.
Comment 19 Michel Dänzer 2011-09-29 04:07:25 UTC
(In reply to comment #17)
> Looking at the Valgrind error from comment 15, it suggests that
> _mesa_make_extension_string (extensions.c:912) makes the assumption that it's
> safe to read 4 bytes from a string and that assumption fails as the string is
> empty with just 1 byte allocated (and set to 0) by calloc.

Indeed, looks like a bug in gcc/glibc. Note that current Mesa Git works around this by rounding up the allocation sizes to multiples of four bytes.
Comment 20 Benoit Jacob 2011-09-29 06:25:14 UTC
Note, again (see comment 17), that this is an extremely common bug in string functions, and that I'm not at all sure that it's the cause of the crash since it could only lead to a crash if the string were not 4-byte aligned, which would be surprising.
Comment 21 Nicholas Miell 2011-09-30 20:45:37 UTC
It isn't a bug, it is a perfectly valid and entirely safe optimization.

Valgrind supplies its own overrides for the string functions that don't do this optimization specifically to prevent false positives from the C library, but in this case the false positive results from a call to strlen() that gcc has inlined.

Does jemalloc (or Firefox's fork of jemalloc) obey the platform ABI requirements for the alignment of the return values of the malloc functions?

Because gcc's value range propagation has carried the fact that the pointer was returned by calloc to the point where it has inlined strlen(), and knowing that the pointer has to be word aligned, it has omitted the usual byte-at-a-time strlen() preamble and skipped straight to word-at-a-time loads.

jemalloc's INSTALL lists the following option:
--disable-tiny
    Disable tiny (sub-quantum-sized) object support.  Technically it is not
    legal for a malloc implementation to allocate objects with less than
    quantum alignment (8 or 16 bytes, depending on architecture), but in
    practice it never causes any problems if, for example, 4-byte allocations
    are 4-byte-aligned.

If, for example, jemalloc decides to stick a 4 byte allocation in the last 4 bytes of a page on a 64-bit plaform (i.e. word size is 8 bytes), and gcc's value range propagation can follow a pointer from a malloc function to an inlined strlen, things will explode violently.
Comment 22 Nicholas Miell 2011-09-30 21:08:50 UTC
Looking at Firefox's fork of jemalloc, TINY_MIN_2POW is 1, making the smallest allocation size 2 bytes, which is broken.
Comment 23 Benoit Jacob 2011-09-30 21:22:05 UTC
(In reply to comment #22)
> Looking at Firefox's fork of jemalloc, TINY_MIN_2POW is 1, making the smallest
> allocation size 2 bytes, which is broken.

This sounds like something that would be very much worth discussing with the JEMalloc author, Jason Evans, by filing a bug at bugzilla.mozilla.org, product 'Core', component 'jemalloc', CC'ing Jason Evans ... and CC me too please, I'm interested. I'm not really competent to file this myself.
Comment 24 Nicholas Miell 2011-10-01 16:59:16 UTC
Already filed https://bugzilla.mozilla.org/show_bug.cgi?id=691003 yesterday.

Upstream jemalloc is fine (and has diverged rather significantly from Firefox's copy).
Comment 25 Benoit Jacob 2011-10-06 04:54:06 UTC
Thanks a lot.

https://bugzilla.mozilla.org/show_bug.cgi?id=691003 is fixed as of today, will be in tomorrow's Nightly.

It seems that this bug could have explained a lot of mysterious Firefox crashiness on linux. Having it fixed fills me with indescriptible glee.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.