Bug 57372 - x11-libs/libxcb media-libs/mesa segfault in __glXGetString
Summary: x11-libs/libxcb media-libs/mesa segfault in __glXGetString
Status: RESOLVED WORKSFORME
Alias: None
Product: Mesa
Classification: Unclassified
Component: GLX (show other bugs)
Version: 9.0
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: mesa-dev
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-11-21 14:38 UTC by Richard Freeman
Modified: 2014-01-22 02:32 UTC (History)
3 users (show)

See Also:
i915 platform:
i915 features:


Attachments
backtrace (22.56 KB, text/plain)
2012-11-21 14:38 UTC, Richard Freeman
Details
backtrace -O0 (22.53 KB, text/plain)
2012-11-21 20:13 UTC, Richard Freeman
Details
Output of glxinfo (25.28 KB, text/plain)
2012-11-25 02:22 UTC, Richard Freeman
Details

Description Richard Freeman 2012-11-21 14:38:47 UTC
Created attachment 70371 [details]
backtrace

Downstream bug:
https://bugs.gentoo.org/show_bug.cgi?id=444159

I'm getting a segfault in libxcb, which seems to be the result of calling xcb_glx_get_string_string_length with a null parameter in __glXGetString.  The call into mesa originates in qt-opengl, called from the application sleepyhead.

Full backtrace attached - happy to generate additional info as required.

I couldn't find documentation concerning error handling in these functions, so I'm not sure what point in the call chain is considered at-fault for passing along bad input.  I did note that __glXGetString does not check the value of reply before passing it along, which might or might not be intended.
Comment 1 Richard Freeman 2012-11-21 20:13:40 UTC
Created attachment 70391 [details]
backtrace -O0
Comment 2 Richard Freeman 2012-11-22 03:19:52 UTC
I've been doing some debugging, and some observations:

1.  xcb_glx_get_string is returning a cookie.
2.  xcb_connection_has_error is 0 both before and after xcb_glx_get_string_reply is called.

I'm slowly teaching myself far more xcb than I ever expected to learn, but if there is anything I can do to capture more useful info let me know - I can patch any libraries as needed as long as it isn't disruptive to the rest of the system.
Comment 3 Richard Freeman 2012-11-22 14:29:08 UTC
One other note - I'm running xorg-server 1.13.0.  That was upgraded at the same time as xcb/mesa, so the impactful change could be on the server side.  From reviewing git it is apparent that none of the xcb/mesa code involved has changed recently (well, as of the versions I'm using, which are libxcb 1.9, mesa 9.0, and xcb-proto 1.8).  Might be some kind of race condition where a change in the server side triggered a failure to handle some case on the client side.
Comment 4 Richard Freeman 2012-11-25 02:21:41 UTC
After further testing I discovered that this segfault does not occur if I use x2goserver as my x11 server.  This is on the same host with the same versions of libxcb/mesa/qt/etc.  

I suspect that this means that either the problem lies in driver-specific mesa code (which would obviously be on the server side), or with the interaction with xorg-server.

I am on ATI hardware.  I'll upload my glxinfo.
Comment 5 Richard Freeman 2012-11-25 02:22:12 UTC
Created attachment 70533 [details]
Output of glxinfo
Comment 6 Richard Freeman 2012-11-28 01:37:20 UTC
I've got some more data points here.  While the segfault occurs client-side, it appears to be triggered by a change server-side.

I can reproduce the problem if the X11 server is xorg-server 1.13.0-r1 (Gentoo version number).  The problem does not occur if the X11 server is 1.12.4.  The behavior is the same if either mesa 8.0.4-r1 or mesa 9.0 is installed on either the server or the client side.  The behavior is reproducible if the client is running on a separate host from the X11 server, if the X11 server is running 1.13.0.

Note also that the problem does not occur with xorg-server 1.12.4 even with the same versions of xf86-input-evdev/keyboard/mouse and xf86-video-ati as were running in 1.13 (though they do need to be recompiled against whatever server version is in use).  So, I can make the problem appear or go away by only changing the version of xorg-server, even when using the older server with the newer drivers.  

I'm tempted to break out the commit log and start bifurcating it next, but if anybody has any pointers I'm all ears...
Comment 7 Richard Freeman 2012-11-28 01:59:17 UTC
One other data point - the problem does not occur for xorg-server 1.13.0 if it is running in virtualbox.  So either the virtualbox video driver or the fact that the server is running in virtualbox eliminates the problem.
Comment 8 Richard Freeman 2012-11-28 21:33:06 UTC
I've determined that the segfault does not occur with xserver commit 90aa2486e394c0344aceb2a70432761665a79333, and it does occur with xserver commit ed6daa15a7dcf8dba930f67401f4c1c8ca2e6fac.

So, that appears to be the commit that introduces this behavior:
commit ed6daa15a7dcf8dba930f67401f4c1c8ca2e6fac
Author: Ian Romanick <ian.d.romanick@intel.com>
Date:   Wed Jul 4 15:21:09 2012 -0700

    glx/dri2: Enable GLX_ARB_create_context_robustness

    If the driver supports __DRI2_ROBUSTNESS, then enable
    GLX_ARB_create_cotnext_robustness as well.  If robustness values are
    passed to glXCreateContextAttribsARB and the driver doesn't support
    __DRI2_ROBUSTNESS, existing drivers will already generate the correct
    error values (so that the correct GLX errors are generated).

    Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
    Reviewed-by: Dave Airlie <airlied@redhat.com>
    Signed-off-by: Keith Packard <keithp@keithp.com>

I'll see if I can get xorg-1.13.0 patched to be missing just this commit.
Comment 9 Richard Freeman 2012-11-29 03:54:31 UTC
Ok, it turns out that I can eliminate this issue in xserver 1.13.0 if I revert two commits:
ed6daa15a7dcf8dba930f67401f4c1c8ca2e6fac (mentioned above)
bcbf95b1bafa6ffe724768b9309295e2fdb4b860
Author: Jon TURNEY <jon.turney@dronecode.org.uk>
Date:   Thu Jul 12 00:36:10 2012 +0100

    Revert bogus GlxPushProvider() in commit a1d41e3
    
    a1d41e3 "Move extension initialisation prototypes into extinit.h"
    also includes a change to GlxExtensionInit to install the swrast GLX
    provider.
    
    Since b86aa74 "GLX: Insert swrast provider from GlxExtensionInit"
    already does this (correctly, by installing the swrast provider
    at the end of the chain, rather than at the beginning), and since this
    would seem to have the effect of making the swrast provider the most
    preferred provider, I'm guessing this wasn't intended.
    
    Signed-off-by: Jon TURNEY <jon.turney@dronecode.org.uk>
    Reviewed-by: Daniel Stone <daniel@fooishbar.org>
    Reviewed-by: Colin Harrison <colin.harrison@virgin.net>

Parts of that second commit appear to have gotten reverted along the way already.  I suspect that the GlxExtensionInit is the part of the second commit that is relevant.

So, I haven't had any time to try to decipher what is going on - I suspect this will be more obvious to those who introduced the code in the first place.  However, taking a look at this is my next step if nobody gets to it first.  

Appropriate action might be to either change xserver or mesa depending on which is actually wrong.
Comment 10 Jon Turney 2012-11-29 13:31:24 UTC
(In reply to comment #9)
> Ok, it turns out that I can eliminate this issue in xserver 1.13.0 if I
> revert two commits:
> ed6daa15a7dcf8dba930f67401f4c1c8ca2e6fac (mentioned above)
> bcbf95b1bafa6ffe724768b9309295e2fdb4b860

Not sure if you are saying you both or either of these need to be reverted.

> Author: Jon TURNEY <jon.turney@dronecode.org.uk>
> Date:   Thu Jul 12 00:36:10 2012 +0100
> 
>     Revert bogus GlxPushProvider() in commit a1d41e3
>     
>     a1d41e3 "Move extension initialisation prototypes into extinit.h"
>     also includes a change to GlxExtensionInit to install the swrast GLX
>     provider.
>     
>     Since b86aa74 "GLX: Insert swrast provider from GlxExtensionInit"
>     already does this (correctly, by installing the swrast provider
>     at the end of the chain, rather than at the beginning), and since this
>     would seem to have the effect of making the swrast provider the most
>     preferred provider, I'm guessing this wasn't intended.

I think this may be a bit of a red herring.  I would guess with this commit
reverted, the X server ends up swrast, which perhaps masks the problem you
are seeing.
Comment 11 Richard Freeman 2012-11-29 13:59:53 UTC
(In reply to comment #10)
> (In reply to comment #9)
> > Ok, it turns out that I can eliminate this issue in xserver 1.13.0 if I
> > revert two commits:
> > ed6daa15a7dcf8dba930f67401f4c1c8ca2e6fac (mentioned above)
> > bcbf95b1bafa6ffe724768b9309295e2fdb4b860
> 
> Not sure if you are saying you both or either of these need to be reverted.

Both of those need to be reverted to prevent the segfault in 1.13.0.  You get a segfault if either commit is still present (when I reverted only the first one it still segfaulted in 1.13.0, but not in git versions prior to the second commit, and if I build 1.13.0 without both of those commits then there is no error).

> I think this may be a bit of a red herring.  I would guess with this commit
> reverted, the X server ends up swrast, which perhaps masks the problem you
> are seeing.

Entirely possible, which is probably why the virtualbox drivers work as well.  However, unless 1.12.4 was also stuck in swrast it seems like there is something else going on.

Note that I'm not suggesting that YOU should actually revert those commits or anything - just that I don't get the segfault without them.  

I guess I could try leaving xserver alone and start messing with the ati drivers instead (stock 1.13.0 with earlier builds of the drivers).
Comment 12 Ian Romanick 2014-01-22 02:05:59 UTC
Does this problem still occur?
Comment 13 Richard Freeman 2014-01-22 02:29:26 UTC
(In reply to comment #12)
> Does this problem still occur?

I haven't seen this in a long time, but it is just as likely because the software that was generating the errors made some substantial QT/OpenGL changes.  

If you want to close this feel free, and if it comes up again I'll happily resubmit with updated info (I'm well past xorg 1.13 at this point).


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.