Bug 99781

Summary: Some Unity games fail assertion on startup in glXCreateContextAttribsARB
Product: Mesa Reporter: Nicholas Miell <nmiell>
Component: GLXAssignee: Hal Gentz <zegentzy>
Status: RESOLVED MOVED QA Contact: mesa-dev
Severity: normal    
Priority: medium CC: alexander, irherder, lemody, nightuser, psychon, rstrode, samuel, synaptykq, tele42k3, wilfried.philips, zegentzy
Version: 13.0   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: Repo the bug
Improved repo of the bug.
Patch for printing out backtraces for possibly relevant functions.
A single-threaded reproducer
Patch that fixes bug.
Requested patch (does not work)

Description Nicholas Miell 2017-02-12 17:15:26 UTC
Multiple Unity games die during startup due to an assertion failure in glXCreateContextAttribsARB.

The assertion in question is "Unknown sequence number while processing queue" in poll_for_event() in libX11's src/xcb_io.c. I'm blaming Mesa for now because you've got to start somewhere and some of the other potential culprits (Xlib, XCB) also live in this Bugzilla.

Affected games include: Oxenfree, The Fall.

The stack trace looks like:
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58
#1  0x00007ffff61b051a in __GI_abort () at abort.c:89
#2  0x00007ffff61a6da7 in __assert_fail_base (fmt=<optimized out>, assertion=assertion@entry=0x7ffff72b0778 "!xcb_xlib_threads_sequence_lost", file=file@entry=0x7ffff72b05eb "xcb_io.c", 
    line=line@entry=259, function=function@entry=0x7ffff72b0a28 <__PRETTY_FUNCTION__.14787> "poll_for_event") at assert.c:92
#3  0x00007ffff61a6e52 in __GI___assert_fail (assertion=assertion@entry=0x7ffff72b0778 "!xcb_xlib_threads_sequence_lost", file=file@entry=0x7ffff72b05eb "xcb_io.c", line=line@entry=259, 
    function=function@entry=0x7ffff72b0a28 <__PRETTY_FUNCTION__.14787> "poll_for_event") at assert.c:101
#4  0x00007ffff723e32a in poll_for_event (dpy=dpy@entry=0x21ca160) at xcb_io.c:256
#5  0x00007ffff723e3db in poll_for_response (dpy=dpy@entry=0x21ca160) at xcb_io.c:274
#6  0x00007ffff723e6cd in _XEventsQueued (dpy=0x21ca160, mode=<optimized out>) at xcb_io.c:349
#7  0x00007ffff7241515 in _XGetRequest (dpy=0x21ca160, type=<optimized out>, len=4) at XlibInt.c:1707
#8  0x00007ffff724162f in _XSeqSyncFunction (dpy=0x21ca160) at XlibInt.c:202
#9  0x00007ffff7240df3 in _XError (dpy=dpy@entry=0x21ca160, rep=rep@entry=0x7fffffffd3e0) at XlibInt.c:1436
#10 0x00007ffff75591a2 in __glXSendErrorForXcb (dpy=dpy@entry=0x21ca160, err=err@entry=0x304bd30) at glx_error.c:81
#11 0x00007ffff7555284 in glXCreateContextAttribsARB (dpy=0x21ca160, config=0x22b3a20, share_context=0x220c3c0, direct=<optimized out>, attrib_list=0x7fffffffd4a0) at create_context.c:119
#12 0x0000000000eac13f in ?? ()
#13 0x0000000000eace22 in ?? ()
#14 0x00000000004654a4 in ?? ()
#15 0x00007ffff6199401 in __libc_start_main (main=0x4641e0, argc=1, argv=0x7fffffffdde8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffddd8)
    at ../csu/libc-start.c:289

This appears to be timing-dependent or a race condition because the games will sometimes successfully start (e.g. the first time run after booting or after dropping the entire VFS cache), and will definitely start if you've set a breakpoint somewhere that triggers at least once per call to glXCreateContextAttribsARB.

As best as I can tell (by setting 4000 breakpoints in gdb using rbreak), Unity does correctly call XInitThreads() before using Xlib or GLX or creating any threads and only ever makes Xlib & GLX calls on thread 1 anyway. Which suggests the race is between the game client and the X server, assuming it is a race. This makes me wonder about the 32/64 sequence number widening in Xlib/XCB which I do not understand at all. When the assertion fails, event_sequence is something like 4294967378 (i.e. 2^32 + 82) while request is 84.

I've written a little LD_PRELOAD shim that calls usleep() in glXCreateContextAttribsARB which makes the games completely playable.
Comment 1 Ray Strode [halfline] 2017-06-30 14:28:47 UTC
can you post an strace -f of the problem happening?

I guess if it's gotten to the assertion things have already gone pair shaped because that's happening when trying to send the error.

Does the problem go away if you boot with maxcpus=1 ?
Comment 2 Ray Strode [halfline] 2017-06-30 15:43:43 UTC
(In reply to Ray Strode [halfline] from comment #1) 
> I guess if it's gotten to the assertion things have already gone pair shaped
> because that's happening when trying to send the error.
So this isn't right I guess.  The game is doing a series of glXCreateContextAttribsARB () calls to figure out what GL level / features are available.  Failure is expected, but blown assertions aren't.

widen() is this:

static void widen(uint64_t *wide, unsigned int narrow)•              
{•                                                                   
→       uint64_t new = (*wide & ~((uint64_t)0xFFFFFFFFUL)) | narrow;•
→       *wide = new + (((uint64_t)(new < *wide)) << 32);•            
}•                                                                   

which is rather confusing but after staring at it for a bit i'll try to deconstruct it:

The first parameter, wide, is an in-out variable.  It takes a 64-bit sequence number as a template, gotten from X_DPY_GET_LAST_REQUEST_READ() and then returns the output of the function back in the same variable.  The second parameter, narrow, is a 32-bit sequence number.  X events only have room for 32-bit sequence numbers associated with them at the xlib level, so narrow is that 32-bit input.  The goal is to recover the lost upper bits, treat that 32-bit sequence number as an offset into a 64-bit number.  We need to find the upper base bits.  Those, are gleaned from the template passed in.

So the first line of the function extracts the upper 32-bits from the template and applies them to the 32-bit number.  Now the 32-bit number has been given its 64 bits back.  But there is a small possibility the 64-bit template and the event don't actually share the same 32-bits.  new should always be greater than the template when widened, because the template is X_DPY_GET_LAST_REQUEST_READ () which is the last request we knew the X server read.  The last request we knew the X server read should be smaller, because this new event is going to be the new last request read; we're in a function that updates the last request read counter known sequence number, and it's happening later in the function.  If the last request previously accounted for is ending up bigger than this new event then it means the upper 32-bits of the sequence number have after actually incremented since then, and the template upper 32-bits are off by one. So the second line is compensating for that.
Comment 3 Ray Strode [halfline] 2017-06-30 15:49:41 UTC
In your case the last request accounted for is 84 and the narrowed error event sequence number is 82.

The code notices 82 is less than 84 and assumes the upper bits were off by one, but what's really happened (I guess) is somehow the last request accounted for is getting updated early. maybe events are getting handled out of order
Comment 4 Ray Strode [halfline] 2017-06-30 17:58:19 UTC
so looking a few stack frames up I see this:

   err = xcb_request_check(c, cookie);•  
   if (err != NULL) {•     
      gc->vtable->destroy(gc);•
      gc = NULL;•  
•                                           
      __glXSendErrorForXcb(dpy, err);•
      free(err);•      
   }•                           

and then __glXSendErrorForXcb does this:

void•
__glXSendErrorForXcb(Display * dpy, const xcb_generic_error_t *err)•
{•
   xError error;•
…
   error.sequenceNumber = err->sequence;•
…
   _XError(dpy, &error);•
…
}•

So we see the sequence number comes from before a destroy hook which calls into the driver.  I guess it's possible (maybe?) that the hook is generating protocol that updates last_request_read.

You could try reordering the gc->vtable->destroy(gc); after the glXSendErrorForXcb and see if that fixes it. 

If that doesn't work you could set a hardware watchpoint on (((struct _XDisplay*)dpy)->last_request_read that just prints a backtrace and continues, so you could see what's updating the last_request_read out from under it.  If that changes timing too much, you could put backtrace() into arrays in X_DPY_SET_LAST_REQUEST_READ() and backtrace_symbols from gdb when it tanks.
Comment 5 Nicholas Miell 2018-02-07 19:08:53 UTC
*** Bug 105004 has been marked as a duplicate of this bug. ***
Comment 6 Samuel Sieb 2018-12-30 04:51:21 UTC
I can verify that limiting the program to one core allows it to start up.
Comment 7 Samuel Sieb 2018-12-30 05:03:31 UTC
Of course, as soon as I write that and try again, it crashes every time even limited to one cpu.  I must have just been lucky that one time or something.
Comment 8 Hal Gentz 2019-03-24 03:49:23 UTC
Created attachment 143763 [details]
Repo the bug

Some of the users of my project, glutin, stumbled upon this bug, https://github.com/tomaka/glutin/issues/1034

While I don't yet have a solution, I got a small C program this (for my computer at least) always reproduces the issue.

Run via `cc -o gl3 main.c -lGL -lX11 -lpthread -g && ./gl3`.

```
$ pacman -Q | grep -i "mesa\|llvm\|clang\|glvnd\|vulkan\|amdgpu\|radeon\|libdrm\|amdvlk\|xorg\|x11\|xcb\|glx"
amdvlk-git r40.48d46fa-1
clang-svn 356068-1
clang-tools-extra-svn 356068-1
lib32-amdvlk-git r40.48d46fa-1
lib32-libdrm-git 6503.4735ca71-1
lib32-libglvnd 1.1.1-1
lib32-libx11 1.6.7-1
lib32-libxcb 1.13.1-1
lib32-llvm-libs-svn 356086-1
lib32-llvm-svn 356086-1
lib32-mesa-git 109188.03a0801bcb1-1
lib32-vulkan-icd-loader 1.1.101-1
lib32-vulkan-radeon-git 109188.03a0801bcb1-1
libdrm-git 6503.4735ca71-1
libdrm-git-debug 6503.4735ca71-1
libglvnd-git 1.1.0.r13.gf92208b-1
libx11 1.6.7-1
libxcb 1.13.1-1
libxkbcommon-x11 0.8.4-1
llvm-libs-svn 356068-1
llvm-ocaml-svn 356068-1
llvm-svn 356068-1
mesa-demos 8.4.0-1
mesa-git 109188.03a0801bcb1-1
mesa-git-debug 109188.03a0801bcb1-1
opencl-mesa-git 109188.03a0801bcb1-1
qt5-x11extras 5.12.2-1
vulkan-headers 1:1.1.102-2
vulkan-icd-loader 1.1.101-1
vulkan-radeon-git 109188.03a0801bcb1-1
vulkan-tools 1.1.101-1
vulkan-validation-layers 1.1.100-2
xcb-proto 1.13-2
xcb-util 0.4.0-2
xcb-util-cursor 0.1.3-2
xcb-util-image 0.4.0-2
xcb-util-keysyms 0.4.0-2
xcb-util-renderutil 0.3.9-2
xcb-util-wm 0.4.1-2
xcb-util-xrm 1.3-1
xf86-video-amdgpu-git 498.6ee8577-1
xf86-video-amdgpu-git-debug 498.6ee8577-1
xorg-bdftopcf 1.1-1
xorg-docs 1.7.1-2
xorg-font-util 1.3.1-2
xorg-font-utils 7.6-5
xorg-fonts-100dpi 1.0.3-4
xorg-fonts-75dpi 1.0.3-4
xorg-fonts-alias 1.0.3-2
xorg-fonts-encodings 1.0.4-5
xorg-iceauth 1.0.8-1
xorg-luit 1.1.1-3
xorg-mkfontscale 1.2.0-2
xorg-server 1.20.4-1
xorg-server-common 1.20.4-1
xorg-server-devel 1.20.4-1
xorg-server-xdmx 1.20.4-1
xorg-server-xephyr 1.20.4-1
xorg-server-xnest 1.20.4-1
xorg-server-xvfb 1.20.4-1
xorg-server-xwayland 1.20.4-1
xorg-sessreg 1.1.1-2
xorg-setxkbmap 1.3.1-2
xorg-smproxy 1.0.6-2
xorg-util-macros 1.19.2-1
xorg-x11perf 1.6.0-2
xorg-xauth 1.0.10-2
xorg-xbacklight 1.2.2-1
xorg-xclock 1.0.8-1
xorg-xcmsdb 1.0.5-2
xorg-xcursorgen 1.0.7-1
xorg-xdpyinfo 1.3.2-2
xorg-xdriinfo 1.0.6-1
xorg-xev 1.2.3-1
xorg-xfontsel 1.0.6-1
xorg-xgamma 1.0.6-2
xorg-xhost 1.0.8-1
xorg-xinit 1.4.1-1
xorg-xinput 1.6.2-2
xorg-xkbcomp 1.4.2-1
xorg-xkbevd 1.1.4-2
xorg-xkbutils 1.0.4-3
xorg-xkill 1.0.5-1
xorg-xlsatoms 1.1.3-1
xorg-xlsclients 1.1.4-1
xorg-xmessage 1.0.5-1
xorg-xmodmap 1.0.10-1
xorg-xpr 1.0.5-1
xorg-xprop 1.2.4-1
xorg-xrandr 1.5.0-2
xorg-xrdb 1.2.0-1
xorg-xrefresh 1.0.6-1
xorg-xset 1.2.4-1
xorg-xsetroot 1.1.2-1
xorg-xvinfo 1.1.4-1
xorg-xwd 1.0.7-1
xorg-xwininfo 1.1.4-1
xorg-xwud 1.0.5-1
xorgproto 2018.4-1
```
Comment 9 Hal Gentz 2019-03-24 03:52:09 UTC
s/this/that/

Also, with foresight, I would have pastebinned pacman's output. That was longer than I thought it would look.
Comment 10 Hal Gentz 2019-03-24 07:32:05 UTC
Created attachment 143764 [details]
Improved repo of the bug.
Comment 11 Hal Gentz 2019-03-24 07:33:19 UTC
Created attachment 143765 [details] [review]
Patch for printing out backtraces for possibly relevant functions.
Comment 12 Uli Schlachter 2019-03-24 10:03:42 UTC
I took a look at this and I would guess that the bug is in mesa. However, I do not want to built mesa myself and thus I will not test my theory.

The function __glXSendError() in mesa's src/glx/glx_error.c invents an X11 protocol error out of thin air. For the sequence number it uses dpy->request. This is the sequence number of the last request that was sent. _XError() will then update dpy->last_request_read based on the sequence number of the error that just "came in".

If now another something comes in with a sequence number less than dpy->last_request_read, since sequence numbers are monotonically increasing, widen() will incorrectly add 1<<32 to the sequence number and things might go downhill afterwards.

At least that's my theory. If my theory is right, changing mesa's __glXSendError() to use dpy->last_request_read instead of dpy->request should fix things.
Comment 13 Uli Schlachter 2019-03-24 16:41:44 UTC
Created attachment 143766 [details]
A single-threaded reproducer

I think I came up with a single-threaded program reproducing the bug. More details in inline comments.

When changing the copy of __glXSendError() from using dpy->request to dpy->last_request_read (and compiling with -DNDEBUG so that the inline asserts() do not trigger), the problem goes away. So, I think the analysis from my previous comment is correct.
Comment 14 Hal Gentz 2019-03-24 21:25:26 UTC
Created attachment 143767 [details] [review]
Patch that fixes bug.

Hey,

This patch fixes this bug. I had to also patch `__glXSendErrorForXcb` because `glXCreateContextAttribsARB` actually uses the former.
Comment 16 Ian Romanick 2019-04-01 21:57:54 UTC
(In reply to Hal Gentz from comment #10)
> Created attachment 143764 [details]
> Improved repo of the bug.

I would very much like to see this as a piglit test.
Comment 17 tele42k3 2019-04-12 13:06:03 UTC
I happened across this issue with a Unity 5.3.8 game and Mesa 18.3.6 on a Sandybridge / xf86-video-intel laptop. The patch linked in comment #15 fixed it for me.

Please consider CC'ing mesa-stable with the fix.
Comment 18 Michel Dänzer 2019-04-30 07:50:34 UTC
Thanks for the report, fixed in Mesa Git master:

https://gitlab.freedesktop.org/mesa/mesa/commit/e91ee763c378d03883eb88cf0eadd8aa916f7878
Comment 19 Timothy Arceri 2019-05-08 03:21:42 UTC
I've reverted this for now as it was causing regressions bug #110632 and bug #110590

commit a01b393c397c846345f03f76f1167dd667e0ee96
Author: Timothy Arceri <tarceri@itsqueeze.com>
Date:   Tue May 7 13:55:32 2019 +1000

    Revert "glx: Fix synthetic error generation in __glXSendError"
    
    This reverts commit e91ee763c378d03883eb88cf0eadd8aa916f7878.
    
    This seems to have broken a number of wine games. Lets revert
    everything for now and try again later.
    
    Acked-by: Adam Jackson <ajax@redhat.com>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110632
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110590

Looks like we need better piglit tests before trying again.
Comment 20 Uli Schlachter 2019-05-08 16:01:12 UTC
Random guess for where the regression comes from:

X11DRV_expect_error() is used to say "I expect that the next request might fail":
https://github.com/wine-mirror/wine/blob/6d801377055911d914226a3c6af8d8637a63fa13/dlls/winex11.drv/x11drv_main.c#L228-L241

...which is then used by the error handler to check if the error is expected and should be ignored:
https://github.com/wine-mirror/wine/blob/6d801377055911d914226a3c6af8d8637a63fa13/dlls/winex11.drv/x11drv_main.c#L262-L271

So... apparently Wine wants to catch errors for "the next X11 request" while mesa tries to invent errors that do not come from any X11 request.

Another random idea for a fix would be: Add a call to XSync(dpy, False) to the "invent an error"-functions. That should guarantee that dpy->request == dpy->last_request_read and so it does not matter any more which of the two sequence numbers mesa uses.

(Well, actually XSync() does a XGetInputFocus() internally, so mesa would then use the sequence number of this GetInputFocus request for its claim that something failed that never happened.)
Comment 21 Hal Gentz 2019-06-21 23:40:30 UTC
I just spotted that this got reopened today when going through my emails for the first time in forever... this is unfortunate.

I'll take another swing at this sometime next week using Uli Schlachter's new proposed method. Hopefully that doesn't cause a regression.
Comment 22 Hal Gentz 2019-06-30 06:50:51 UTC
Created attachment 144685 [details] [review]
Requested patch (does not work)

I tried implementing your suggested patch, Uli, but it did not work. 

XSync also tried to call LockDisplay, which for some reason blocks with our call to LockDisplay. This behavior, I think, contradicts what the XLockDisplay docs say: 

>  Nested calls to XLockDisplay work correctly; the display will not actually be unlocked until XUnlockDisplay has been called the same number of times as XLockDisplay.

This behavior really suprized me, so I'm not sure what's up. My only suspician is that the mutex not being made with a call to pthread_mutex_init w/ the PTHREAD_MUTEX_RECURSIVE attribute. I tried grepping for `pthread_mutex_init`, `PTHREAD_MUTEX_RECURSIVE`, `pthread` and `RECURSIVE` in the libx11 codebase to no avail. The only file I found was `UIThrStubs.c`, which if I understand correctly, is only used when multi-threading support is disabled at compile time.

I found a call to some `xmutex_init` function in locking.c, however, this function only receives one parameter, unlike the two required by `pthread_mutex_init`. gdb shows (I had to use `si` instead of `s`) that calls to `xmutex_init` actually result in `pthread_mutex_init` getting called, however, I'm not sure if that's because `xmutex_init` is some clever macro, or if `-O3` inlined it. Nevertheless, I could not find any definition for `xmutex_init` beyond the one in `UIThrStubs.c`.


I think we could possibly move the call to XSync after the call to UnlockDisplay, however, I'm concerned that this might lead to some sort of race condition if some other thread locking the display between our call to UnlockDisplay and XSync's call to LockDisplay.
Comment 23 Hal Gentz 2019-06-30 06:54:04 UTC
s/blocks with/blocks from/
s/suspician/suspicion/
s/the mutex/the mutex is/

Apologies in advance for inconsistent back ticking and not spell checking. I really wish I could edit replies.
Comment 24 Hal Gentz 2019-06-30 06:56:24 UTC
s/if some other thread locking/if some other thread locks/

Sorry, again.
Comment 25 GitLab Migration User 2019-09-18 17:45:38 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/108.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.