Bug 84252 - DRI3 rendering gets stuck waiting for present idle notify event
Summary: DRI3 rendering gets stuck waiting for present idle notify event
Status: RESOLVED FIXED
Alias: None
Product: XCB
Classification: Unclassified
Component: Library (show other bugs)
Version: 1.11
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: xcb mailing list dummy
QA Contact: xcb mailing list dummy
URL:
Whiteboard:
Keywords:
: 84926 (view as bug list)
Depends on:
Blocks:
 
Reported: 2014-09-23 16:36 UTC by Tobias Jakobi
Modified: 2016-02-15 04:24 UTC (History)
15 users (show)

See Also:
i915 platform:
i915 features:


Attachments
xtrace output (18.93 KB, text/plain)
2014-09-24 17:30 UTC, Tobias Jakobi
Details
Patch proposition (9.81 KB, text/plain)
2015-04-17 17:36 UTC, Axel Davy
Details
Second Patch Proposition (10.35 KB, patch)
2015-04-19 19:48 UTC, Axel Davy
Details | Splinter Review
Test that triggers the hang (8.96 KB, application/octet-stream)
2015-05-16 09:12 UTC, Axel Davy
Details
XCB special event wakeup fixes (839 bytes, patch)
2015-05-19 02:14 UTC, Michel Dänzer
Details | Splinter Review
Another test that runs into a different, but quite similar hang (2.91 KB, text/plain)
2015-06-12 13:12 UTC, Uli Schlachter
Details
Make _xcb_in_wake_up_next_reader() wake up threads in xcb_wait_for_special_event() (4.35 KB, patch)
2015-06-12 13:24 UTC, Uli Schlachter
Details | Splinter Review
attachment-29921-0.html (799 bytes, text/html)
2016-01-25 21:16 UTC, Tim Writer
Details

Description Tobias Jakobi 2014-09-23 16:36:39 UTC
Hello,

this is a cleaned up version of (the now closed) bug #81548.

Software stack:
xorg-server git tip
mesa git tip
libdrm git tip
libxcb git tip
vanilla kernel 3.16.3

Primary GPU is a hsw/gen7.5 (i7-4700HQ), the secondary GPU is a R9 M265X. DRI3/PRIME is used.

Issue: My test application is the x86 variant of HL2 Ep1. After loading a specific savegame I can reliably get the game to "hang". Attaching gdb yields the backtrace given below.

Affected configurations: DRI3 with SNA backend enabled in the DDX.
Switching to the UXA backend removes the issue, which (maybe?) points at synchronisation issues that are discussed in bug #81551. The patch proposed in the bugreport is applied, but doesn't cure the issue.

Backtrace:
#0  0xf771cd10 in __kernel_vsyscall ()
(gdb) bt
#0  0xf771cd10 in __kernel_vsyscall ()
#1  0xf7348c8c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2  0xf75e55cc in pthread_cond_wait () from /lib/libc.so.6
#3  0xf70b751f in ?? () from /usr/lib/libxcb.so.1
#4  0xf70c13a5 in xcb_wait_for_special_event () from /usr/lib/libxcb.so.1
#5  0xf7498b49 in dri3_find_back (c=c@entry=0xa410000, priv=priv@entry=0x9fe23f0) at dri3_glx.c:1191
#6  0xf749932b in dri3_get_buffer (format=4099, buffer_type=buffer_type@entry=dri3_buffer_back, loaderPrivate=loaderPrivate@entry=0x9fe23f0, 
    driDrawable=<optimized out>) at dri3_glx.c:1217
#7  0xf7499fd0 in dri3_get_buffers (driDrawable=0xa455110, format=4099, stamp=0x9e706e0, loaderPrivate=0x9fe23f0, buffer_mask=<optimized out>, 
    buffers=0xb8ce6bec) at dri3_glx.c:1394
#8  0xf43a4cf3 in dri_image_drawable_get_buffers (statts_count=<optimized out>, statts=<optimized out>, images=<optimized out>, drawable=<optimized out>)
    at dri2.c:254
#9  dri2_allocate_textures (ctx=0xa455590, drawable=0x9e706e0, statts=0x9ff2b80, statts_count=2) at dri2.c:377
#10 0xf43a199c in dri_st_framebuffer_validate (stctx=0xa579000, stfbi=0x9e706e0, statts=0x9ff2b80, count=2, out=0xb8ce6cc0) at dri_drawable.c:83
#11 0xf42d0378 in st_framebuffer_validate (stfb=stfb@entry=0x9ff2800, st=st@entry=0xa579000) at ../../src/mesa/state_tracker/st_manager.c:200
#12 0xf42d1bb7 in st_manager_validate_framebuffers (st=st@entry=0xa579000) at ../../src/mesa/state_tracker/st_manager.c:862
#13 0xf4292bf6 in st_validate_state (st=st@entry=0xa579000) at ../../src/mesa/state_tracker/st_atom.c:180
#14 0xf429a6e5 in st_BlitFramebuffer (ctx=0x9ef0000, srcX0=0, srcY0=0, srcX1=1920, srcY1=1080, dstX0=0, dstY0=1080, dstX1=1920, dstY1=0, mask=16384, 
    filter=9728) at ../../src/mesa/state_tracker/st_cb_blit.c:94
#15 0xf41358c3 in _mesa_BlitFramebuffer (srcX0=0, srcY0=0, srcX1=1920, srcY1=1080, dstX0=0, dstY0=1080, dstX1=1920, dstY1=0, mask=16384, filter=9728)
    at ../../src/mesa/main/blit.c:509
#16 0xf5831e13 in ?? () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/libtogl.so
#17 0xf5832617 in ?? () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/libtogl.so
#18 0xf5823c3b in IDirect3DDevice9::Present(_RECT const*, _RECT const*, void*, RGNDATA const*) ()
   from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/libtogl.so
#19 0xecd6646d in ?? () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/shaderapidx9.so
#20 0xf14edfde in ?? () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/materialsystem.so
#21 0xf14ee31d in ?? () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/materialsystem.so
#22 0xf14ce644 in ?? () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/materialsystem.so
#23 0xf14d684a in ?? () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/materialsystem.so
#24 0xf14d6697 in ?? () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/materialsystem.so
#25 0xf5d0b49a in ?? () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/libvstdlib.so
#26 0xf5df29e0 in CThread::ThreadProc(void*) () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/libtier0.so
#27 0xf7344f45 in start_thread () from /lib/libpthread.so.0
#28 0xf75d88ee in clone () from /lib/libc.so.6

With best wishes,
Tobias
Comment 1 Chris Wilson 2014-09-23 19:13:20 UTC
Compile with the ddx --enable-debug=full and prepare to be amused.
Comment 2 Tobias Jakobi 2014-09-23 19:48:28 UTC
Done, here's the log:
http://www.math.uni-bielefeld.de/~tjakobi/archive/Xorg.0.log.xz
Comment 3 Chris Wilson 2014-09-23 20:21:27 UTC
Every single present flip, unflip and vblank wait is completed and notified, all appear to be done so on the next vblank. Looks like present/mesa didn't coordinate their fence usage correctly.
Comment 4 Chris Wilson 2014-09-23 20:30:28 UTC
Please try and capture the X command stream e.g. using xtrace. The other suspicion is that there was an error sent which the application didn't notice as it is waiting upon the special event.
Comment 5 Chris Wilson 2014-09-23 20:40:40 UTC
In the debug log, there is one oddity: mesa reuses an active pixmap.



[ 90578.146] sna_dri3_pixmap_from_fd(fd=48, width=1000, height=678, stride=4096, depth=24, bpp=32)
[ 90578.146] kgem_create_for_prime(name=48)
[ 90578.146] kgem_create_for_prime: estimated size=2777088, actual=2777088
[ 90578.146] kgem_create_for_prime: new handle=22, tiling=0
[ 90578.146] __kgem_bo_init(handle=22, num_pages=678)
[ 90578.146] kgem_create_for_prime: imported handle=22 has caching 0
[ 90578.146] kgem_create_for_prime: interpreting handle=22 as a foreign scanout
[ 90578.146] kgem_check_surface_size(width=1000, height=678, bpp=32, tiling=0, pitch=4096, size=2777088)
[ 90578.146] kgem_surface_size: tile_width=8, tile_height=1 => aligned pitch=4000, height=678
[ 90578.146] kgem_check_surface_size: min_pitch=4000, min_size=2715648
[ 90578.146] kgem_check_surface_size: tile_width=1, tile_size=1
[ 90578.146] create_pixmap: allocating pixmap 0x0, depth=24, size=0
[ 90578.146] create_pixmap: serial=11003, usage=-1, 0x0
[ 90578.146] sna_pixmap_attach_to_bo: attaching GPU handle=22 to pixmap=11004
...
[ 90578.157] sna_dri3_pixmap_from_fd(fd=48, width=1000, height=678, stride=4096, depth=24, bpp=32)
[ 90578.157] kgem_create_for_prime(name=48)
[ 90578.157] kgem_create_for_prime: estimated size=2777088, actual=2777088
[ 90578.157] kgem_create_for_prime: new handle=22, tiling=0
[ 90578.157] __kgem_bo_init(handle=22, num_pages=678)
[ 90578.157] kgem_create_for_prime: imported handle=22 has caching 0
[ 90578.157] kgem_create_for_prime: interpreting handle=22 as a foreign scanout
[ 90578.157] sna_dri3_pixmap_from_fd: imported fd matches existing DRI3 pixmap=11004
[ 90578.157] _kgem_bo_destroy: handle=0, proxy? 0
[ 90578.157] __kgem_bo_destroy: handle=0, size=2777088
[ 90578.157] kgem_bo_free: handle=0, size=2777088
[ 90578.157] kgem_bo_free: releasing 0x0:0x0 vma for handle=0, count=0
[ 90578.157] sna_sync_flush(pixmap=11004)
[ 90578.157] sna_sync_flush: flushing prime GPU bo, handle=22
[ 90578.157] sna_pixmap_move_to_gpu(pixmap=11004, usage=-1), flags=4b
[ 90578.157] sna_pixmap_move_to_gpu: already all-damaged
[ 90578.157] sna_pixmap_mark_active: pixmap=11004, handle=

But that looks to be used for copying onto the scanout, so the compositor. Why it is not tiled is beyond, but that's the clients perogative. Anyway, my guess is that it is unrelated to the game.
Comment 6 Tobias Jakobi 2014-09-23 21:07:43 UTC
What do you mean by compositor? Desktop compositor? Because there is none active here.
Comment 7 Chris Wilson 2014-09-24 06:11:51 UTC
(In reply to comment #6)
> What do you mean by compositor? Desktop compositor? Because there is none
> active here.

Just a compositing WM. Then it may be related, or not as it seems to be not involved in the flipping.
Comment 8 Tobias Jakobi 2014-09-24 10:18:24 UTC
I use Xfce as DE, but I don't use the compositor in xfwm4. So no compositor is active here.

I still need to figure out how to properly setup xtrace in the chroot. Logs maybe later.
Comment 9 Chris Wilson 2014-09-24 11:25:32 UTC
I still find the reused Pixmap suspicious. It is only used in conjunction with Present, so I can't see a reason for an extra refcount inside the Xserver, and for mesa to reallocate the same buffer means it must have freed it and the Pixmap.

I would be interested to see if:

diff --git a/src/mesa/drivers/dri/i965/brw_context.c b/src/mesa/drivers/dri/i965/brw_context.c
index fa0210c..c8c2dfe 100644
--- a/src/mesa/drivers/dri/i965/brw_context.c
+++ b/src/mesa/drivers/dri/i965/brw_context.c
@@ -586,7 +586,7 @@ brw_process_driconf_options(struct brw_context *brw)
    case DRI_CONF_BO_REUSE_DISABLED:
       break;
    case DRI_CONF_BO_REUSE_ALL:
-      intel_bufmgr_gem_enable_reuse(brw->bufmgr);
+      //intel_bufmgr_gem_enable_reuse(brw->bufmgr);
       break;
    }
(there's also a drirc option bo_reuse=0)

prevents "sna_dri3_pixmap_from_fd: imported fd matches existing DRI3 pixmap"

I haven't seen that with a radeon hd7700/snb combo yet.
Comment 10 Tobias Jakobi 2014-09-24 17:14:12 UTC
I can't get xtrace to work. I'm not sure if it's due to it not understanding the DRI3 protocol (it doesn't have a 'template' file for DRI3 and Present), or because of different reasons. At least I can't get even glxgears to work properly (it hangs before showing a window). DRI2 works though.

Any other ideas for tools useable to trace this?

Concerning the BO reuse thing. I've disabled bo_reuse but the issue still happens, same backtrace as before.
Comment 11 Tobias Jakobi 2014-09-24 17:29:30 UTC
backtrace from glxgears when it hangs:
#0  0xf7752d10 in __kernel_vsyscall ()
#1  0xf737ef53 in poll () from /lib/libc.so.6
#2  0xf7081046 in _xcb_conn_wait (c=0x8442a78, cond=0xfff65cdc, vector=0x0, count=0x0)
    at /var/tmp/portage/x11-libs/libxcb-9999/work/libxcb-9999/src/xcb_conn.c:479
#3  0xf7082ec3 in wait_for_reply (c=0x8442a78, request=23, e=0x0) at /var/tmp/portage/x11-libs/libxcb-9999/work/libxcb-9999/src/xcb_in.c:491
#4  0xf7083018 in xcb_wait_for_reply (c=0x8442a78, request=23, e=0x0) at /var/tmp/portage/x11-libs/libxcb-9999/work/libxcb-9999/src/xcb_in.c:521
#5  0xf70d7db0 in xcb_dri3_open_reply (c=c@entry=0x8442a78, cookie=..., e=e@entry=0x0) at dri3.c:146
#6  0xf7621619 in dri3_open (provider=0, root=156, dpy=<optimized out>) at dri3_glx.c:1573
#7  dri3_create_screen (screen=0, priv=0x844c618) at dri3_glx.c:1812
#8  0xf75f563a in AllocAndFetchScreenConfigs (priv=0x844c618, dpy=0x8442008) at glxext.c:788
#9  __glXInitialize (dpy=dpy@entry=0x8442008) at glxext.c:902
#10 0xf75f1b7d in GetGLXPrivScreenConfig (dpy=dpy@entry=0x8442008, scrn=scrn@entry=0, ppriv=ppriv@entry=0xfff65e84, ppsc=ppsc@entry=0xfff65e88)
    at glxcmds.c:172
#11 0xf75f241e in GetGLXPrivScreenConfig (ppsc=0xfff65e88, ppriv=0xfff65e84, scrn=0, dpy=0x8442008) at glxcmds.c:168
#12 glXChooseVisual (dpy=0x8442008, screen=0, attribList=0xfff66080) at glxcmds.c:1249
#13 0x0804b2aa in ?? ()
#14 0x08049573 in ?? ()
#15 0xf72b6943 in __libc_start_main () from /lib/libc.so.6
#16 0x08049f01 in ?? ()
Comment 12 Tobias Jakobi 2014-09-24 17:30:31 UTC
Created attachment 106805 [details]
xtrace output
Comment 13 Chris Wilson 2014-09-24 20:12:54 UTC
(In reply to comment #10)
> I can't get xtrace to work. I'm not sure if it's due to it not understanding
> the DRI3 protocol (it doesn't have a 'template' file for DRI3 and Present),
> or because of different reasons. At least I can't get even glxgears to work
> properly (it hangs before showing a window). DRI2 works though.
> 
> Any other ideas for tools useable to trace this?

It's just lacking the protocol definition, the data is still in the xtrace, just need to learn hex...

Does it hang as fast as the xtrace suggests? Right at the very first request?

> Concerning the BO reuse thing. I've disabled bo_reuse but the issue still
> happens, same backtrace as before.

Right, that's reassuring, but I am curious as to whether that prevents it from trying to construct two Pixmaps from the same handle - i.e. the splat in the full debug log.

Most important question is does it really hang that early?
Comment 14 Tobias Jakobi 2014-09-25 00:04:07 UTC
(In reply to comment #13)
> It's just lacking the protocol definition, the data is still in the xtrace,
> just need to learn hex...
> 
> Does it hang as fast as the xtrace suggests? Right at the very first request?
Yes, it hangs immediately (on DRI3Open). Like I said, there is no window created.


> Right, that's reassuring, but I am curious as to whether that prevents it
> from trying to construct two Pixmaps from the same handle - i.e. the splat
> in the full debug log.
So you need another full debug log with bo_reuse disabled?


> Most important question is does it really hang that early?
You mean with bo_reuse?

The issue appears almost immediately after loading the savegame. I load the savegame, then I turn the camera and the image freezes. This is less than 2 seconds in-game.
Now with full debug it looks longer, I could take some steps, etc. -- let it be maybe 6 seconds.

I haven't noticed these difference between bo_reuse enabled and disabled.
Comment 15 Chris Wilson 2014-09-25 07:01:02 UTC
Oh drat, xtrace is not going to work without some serious magic to handle the fd passing. That's a nuisance.

Oh well, might as well see what the xserver debugging:

diff --git a/present/present.c b/present/present.c
index e838da7..4f1ea49 100644
--- a/present/present.c
+++ b/present/present.c
@@ -36,7 +36,7 @@ static uint64_t         present_event_id;
 static struct xorg_list present_exec_queue;
 static struct xorg_list present_flip_queue;
 
-#if 0
+#if 1
 #define DebugPresent(x) ErrorF x
 #else
 #define DebugPresent(x)

turns up.
Comment 16 Chris Wilson 2014-09-25 07:51:16 UTC
(In reply to comment #14)
> > Right, that's reassuring, but I am curious as to whether that prevents it
> > from trying to construct two Pixmaps from the same handle - i.e. the splat
> > in the full debug log.
> So you need another full debug log with bo_reuse disabled?

Right, might as well combine that with DebugPresent since it doesn't affect the reproduction of the hang.
 
> > Most important question is does it really hang that early?
> You mean with bo_reuse?

Nah, I was thinking about the xtrace, hoping that had some insight into the hang. But that's a lost cause for now :(
Comment 17 Chris Wilson 2014-10-06 12:15:43 UTC
One possible explanation:

xserver commit 9bc01dfc7070a40f5948588895b3a11dd1636d0e
Author: Axel Davy <axel.davy@ens.fr>
Date:   Sat Sep 27 23:17:13 2014 +0200

    Fix present_pixmap when using present_notify_msc
    
    Calling present_notify_msc could cancel a pending pixmap presentation.
    
    Signed-off-by: Axel Davy <axel.davy@ens.fr>
    Reviewed-by: Keith Packard <keithp@keithp.com>
    Signed-off-by: Keith Packard <keithp@keithp.com>
Comment 18 Rodrigo Vivi 2014-10-15 20:45:57 UTC
*** Bug 84926 has been marked as a duplicate of this bug. ***
Comment 19 Tobias Jakobi 2014-10-16 23:51:47 UTC
I haven't found time yet to create the traces. Just some observations:

Updating mesa fixes the issue with my current ingame configuration (which is vsync off). I tracked this down to [url=http://cgit.freedesktop.org/mesa/mesa/commit/?id=f7a355556ef5fe23056299a77414f9ad8b5e5a1d]glx/dri3: Use four buffers until X driver supports async flips[/url].

I had vsync disabled since initially I wanted to benchmark the HSW. With PRIME and vsync disabled ingame the rendering however is very 'shacky', to the point where it's not playable.

I thus re-enabled vsync and the issue reappeared (strange I guess). Initial tests show that [url=https://www.freedesktop.org/patch/34966/]Correct present_notify_msc vblank target[/url] seems to fix the issue (with vsync on).
Comment 20 Tobias Jakobi 2014-10-16 23:53:04 UTC
Guess I shouldn't use bbcode here :(
Comment 21 Tobias Jakobi 2014-11-16 20:57:10 UTC
Finally managed to get this trace.

This is with xorg-server git tip (63bb5c5ef16edf652179770294dcca4fc07dc992), mesa git tip (aafbebe8ab64012fe625037c7773fa779b6f442f) and vanilla 3.17.3.

xserver was patched with patch from comment #15 to enable debugging of the Present extension. xf86-video-intel was build with full debug.

bo_reuse was disabled and vsync was switched on ingame.

Here's the compressed log:
http://www.math.uni-bielefeld.de/~tjakobi/archive/xorg.log.xz

Also note that the impression from comment #19 must have been a fluke ("Correct present_notify_msc vblank target" doesn't seem to have any effect on the freeze).
Comment 22 Chris Wilson 2014-11-17 09:52:50 UTC
Flips proceeds fine up until the final unflip is completed (event 1487). After that, present is copying invisible windows (until event 1565, the last event), but as far as I can see each copy is still followed by an idle notification.

The unflip is for when the window becomes clipped, and present switches back to the original pixmap and keeps on doing the clipped copy updates.

As far as the trace goes, there is not a missing idle event.
Comment 23 Tobias Jakobi 2014-11-17 19:34:52 UTC
Would it help to create another trace without "[PATCH] sna/dri3: Mesa relies upon implicit fences" applied to xf86-video-intel?
Comment 24 Chris Wilson 2014-11-17 21:24:52 UTC
(In reply to Tobias Jakobi from comment #23)
> Would it help to create another trace without "[PATCH] sna/dri3: Mesa relies
> upon implicit fences" applied to xf86-video-intel?

No, that shows up the in log and doesn't actually change the ordering of (Present) events. At the very least, the bug still occurs and I can see all the server side idle events that I would expect (and since it appears to continue to process PresentPixmaps, I am at a loss).
Comment 25 Michael Thayer 2014-11-27 09:33:13 UTC
Hitting this a lot with VirtualBox trunk builds running guests with 3D pass-through enabled on an Ubuntu 14.10 host system.  Usually when I resize the window/guest screen.
Comment 26 Tobias Jakobi 2014-12-26 19:30:34 UTC
Update:
xserver git master 826e7c2b36f192fbbe7ddff37eb559f4d6301146
xf86-video-intel git master 0d42b0ed25d4112e0b3e3218e5c42947bbeb9e27

DRI3 is enabled. In-game vsync is on. I haven't managed to hit the issue with this configuration. I'm cautious though (this needs more tests), so I'm leaving it open for now. My guess is that the recent PRESENT patches by Mario Kleiner might have fixed it.

What I noticed however is that tearing still happens (even though vsync is on). This is probably due to PRIME.
Comment 27 Tobias Jakobi 2015-01-06 21:32:06 UTC
OK, so I did some extensive tests today and while the issue is now harder to trigger, it is still there. A kinda reliable way to trigger it, is to toggle between in-game and main menu a few times.
Comment 28 Chris Wilson 2015-01-31 11:59:40 UTC
Not sna specific, reports of getting stuck in wait_for_specific_event with uxa as well.
Comment 29 Tobias Jakobi 2015-01-31 13:32:17 UTC
Hmm, that is strange. The last time I tested this with UXA I couldn't trigger the issue. I guess I need to recheck this more thoroughly.
Comment 30 Tobias Jakobi 2015-03-05 13:54:14 UTC
I retested today with updated stack:
mesa (29c23644cc8cfe9a1cdd75f79a96a1c9b49d26fa)
xf86-video-intel (db82617464e55432522e6199a88408ff0187f6ff)
xserver (3a06faf3fcdb7451125a46181f9152e8e59e9770)

Issue is still present (with SNA):
#0  0xf77afc90 in __kernel_vsyscall ()
#1  0xf73dac8c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2  0xf76745cc in pthread_cond_wait () from /lib/libc.so.6
#3  0xf7192643 in ?? () from /usr/lib/libxcb.so.1
#4  0xf7194d63 in xcb_wait_for_special_event () from /usr/lib/libxcb.so.1
#5  0xf7524bd9 in dri3_find_back (c=c@entry=0x9fc0000, priv=priv@entry=0x9ffe000) at dri3_glx.c:1263
#6  0xf752542b in dri3_get_buffer (format=4099, buffer_type=buffer_type@entry=dri3_buffer_back, loaderPrivate=loaderPrivate@entry=0x9ffe000, 
    driDrawable=<optimized out>) at dri3_glx.c:1289
#7  0xf752624d in dri3_get_buffers (driDrawable=0xa0a05d0, format=4099, stamp=0x9b9cc60, loaderPrivate=0x9ffe000, buffer_mask=<optimized out>, 
    buffers=0xd460cbec) at dri3_glx.c:1466
#8  0xf442c483 in dri_image_drawable_get_buffers (statts_count=<optimized out>, statts=<optimized out>, images=<optimized out>, drawable=<optimized out>)
    at dri2.c:254
#9  dri2_allocate_textures (ctx=0xa0a0c00, drawable=0x9b9cc60, statts=0x9ba0b80, statts_count=2) at dri2.c:377
#10 0xf44290ec in dri_st_framebuffer_validate (stctx=0x9fee000, stfbi=0x9b9cc60, statts=0x9ba0b80, count=2, out=0xd460ccc0) at dri_drawable.c:83
#11 0xf434b9b8 in st_framebuffer_validate (stfb=stfb@entry=0x9ba0800, st=st@entry=0x9fee000) at state_tracker/st_manager.c:200
#12 0xf434d1f7 in st_manager_validate_framebuffers (st=st@entry=0x9fee000) at state_tracker/st_manager.c:863
#13 0xf430c866 in st_validate_state (st=st@entry=0x9fee000) at state_tracker/st_atom.c:180
#14 0xf4312f7c in st_BlitFramebuffer (ctx=0xa18a000, readFB=0xb99e380, drawFB=0x9ba0800, srcX0=0, srcY0=0, srcX1=1920, srcY1=1080, dstX0=0, dstY0=1080, 
    dstX1=1920, dstY1=0, mask=16384, filter=9728) at state_tracker/st_cb_blit.c:94
#15 0xf41adc17 in _mesa_BlitFramebuffer (srcX0=0, srcY0=0, srcX1=1920, srcY1=1080, dstX0=0, dstY0=1080, dstX1=1920, dstY1=0, mask=16384, filter=9728)
    at main/blit.c:509
#16 0xf58d8e13 in ?? () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/libtogl.so
#17 0xf58d9617 in ?? () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/libtogl.so
#18 0xf58cac3b in IDirect3DDevice9::Present(_RECT const*, _RECT const*, void*, RGNDATA const*) ()
   from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/libtogl.so
#19 0xecc8146d in ?? () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/shaderapidx9.so
#20 0xf137bfde in ?? () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/materialsystem.so
#21 0xf137c31d in ?? () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/materialsystem.so
#22 0xf135c644 in ?? () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/materialsystem.so
#23 0xf136484a in ?? () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/materialsystem.so
#24 0xf1364697 in ?? () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/materialsystem.so
#25 0xf5db349a in ?? () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/libvstdlib.so
#26 0xf5e9a9e0 in CThread::ThreadProc(void*) () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/libtier0.so
#27 0xf73d6f45 in start_thread () from /lib/libpthread.so.0
#28 0xf76678ee in clone () from /lib/libc.so.6

I haven't checked UXA yet.
Comment 31 Tobias Jakobi 2015-03-13 09:25:20 UTC
I tried to triage this further. The issue is a threading problem. I can 'fix' this reliably by disabling (in-game option) the multi-core rendering.

Here's the thread information for both configurations.

Multi-core rendering disabled:
(gdb) info threads
  Id   Target Id         Frame 
  20   Thread 0xf5344b40 (LWP 5361) "hl2_linux" 0xf77b7c90 in __kernel_vsyscall ()
  19   Thread 0xf4b43b40 (LWP 5362) "IOJob0" 0xf77b7c90 in __kernel_vsyscall ()
  18   Thread 0xf2a9cb40 (LWP 5364) "hl2_linux" 0xf77b7c90 in __kernel_vsyscall ()
  17   Thread 0xeb3f3b40 (LWP 5365) "UICEFThread" 0xf77b7c90 in __kernel_vsyscall ()
  16   Thread 0xeb2f2b40 (LWP 5366) "BrowserDBThread" 0xf77b7c90 in __kernel_vsyscall ()
  15   Thread 0xeaaf1b40 (LWP 5367) "SignalSender" 0xf77b7c90 in __kernel_vsyscall ()
  14   Thread 0xea8dab40 (LWP 5368) "NetworkChangeNo" 0xf77b7c90 in __kernel_vsyscall ()
  13   Thread 0xea0d9b40 (LWP 5369) "inotify_reader" 0xf77b7c90 in __kernel_vsyscall ()
  12   Thread 0xe98d8b40 (LWP 5370) "Cef_FileThread" 0xf77b7c90 in __kernel_vsyscall ()
  11   Thread 0xe90d7b40 (LWP 5371) "Cef_IOThread" 0xf77b7c90 in __kernel_vsyscall ()
  10   Thread 0xe88d6b40 (LWP 5372) "AppCacheDBThrea" 0xf77b7c90 in __kernel_vsyscall ()
  9    Thread 0xe7e0fb40 (LWP 5373) "hl2_linux" 0xf77b7c90 in __kernel_vsyscall ()
  8    Thread 0xe7af4b40 (LWP 5374) "SDLAudioDev2" 0xf77b7c90 in __kernel_vsyscall ()
  7    Thread 0xda757b40 (LWP 5380) "CmpJob0" 0xf77b7c90 in __kernel_vsyscall ()
  6    Thread 0xda656b40 (LWP 5381) "CmpJob1" 0xf77b7c90 in __kernel_vsyscall ()
  5    Thread 0xda555b40 (LWP 5382) "CmpJob2" 0xf77b7c90 in __kernel_vsyscall ()
  4    Thread 0xd5453b40 (LWP 5383) "SaveJob0" 0xf77b7c90 in __kernel_vsyscall ()
  3    Thread 0xd5352b40 (LWP 5384) "QueuedPacketSen" 0xf77b7c90 in __kernel_vsyscall ()
  2    Thread 0xd007db40 (LWP 5385) "AchievementSave" 0xf77b7c90 in __kernel_vsyscall ()
* 1    Thread 0xf7137700 (LWP 5360) "hl2_linux" 0xf77b7c90 in __kernel_vsyscall ()

Multi-core rendering enabled:
(gdb) info threads
  Id   Target Id         Frame 
  21   Thread 0xf532db40 (LWP 5449) "hl2_linux" 0xf77a0c90 in __kernel_vsyscall ()
  20   Thread 0xf4b2cb40 (LWP 5450) "IOJob0" 0xf77a0c90 in __kernel_vsyscall ()
  19   Thread 0xf2a85b40 (LWP 5461) "hl2_linux" 0xf77a0c90 in __kernel_vsyscall ()
  18   Thread 0xeb3dcb40 (LWP 5463) "UICEFThread" 0xf77a0c90 in __kernel_vsyscall ()
  17   Thread 0xeb2dbb40 (LWP 5464) "BrowserDBThread" 0xf77a0c90 in __kernel_vsyscall ()
  16   Thread 0xeaadab40 (LWP 5465) "SignalSender" 0xf77a0c90 in __kernel_vsyscall ()
  15   Thread 0xea8c3b40 (LWP 5469) "NetworkChangeNo" 0xf77a0c90 in __kernel_vsyscall ()
  14   Thread 0xea0c2b40 (LWP 5471) "inotify_reader" 0xf77a0c90 in __kernel_vsyscall ()
  13   Thread 0xe98c1b40 (LWP 5472) "Cef_FileThread" 0xf77a0c90 in __kernel_vsyscall ()
  12   Thread 0xe90c0b40 (LWP 5473) "Cef_IOThread" 0xf77a0c90 in __kernel_vsyscall ()
  11   Thread 0xe88bfb40 (LWP 5475) "AppCacheDBThrea" 0xf77a0c90 in __kernel_vsyscall ()
  10   Thread 0xe7df8b40 (LWP 5476) "hl2_linux" 0xf77a0c90 in __kernel_vsyscall ()
  9    Thread 0xe7addb40 (LWP 5477) "SDLAudioDev2" 0xf77a0c90 in __kernel_vsyscall ()
  8    Thread 0xe4895b40 (LWP 5481) "CmpJob0" 0xf77a0c90 in __kernel_vsyscall ()
  7    Thread 0xe4794b40 (LWP 5482) "CmpJob1" 0xf77a0c90 in __kernel_vsyscall ()
  6    Thread 0xe4693b40 (LWP 5483) "CmpJob2" 0xf77a0c90 in __kernel_vsyscall ()
  5    Thread 0xdf591b40 (LWP 5484) "SaveJob0" 0xf77a0c90 in __kernel_vsyscall ()
  4    Thread 0xdf490b40 (LWP 5485) "QueuedPacketSen" 0xf77a0c90 in __kernel_vsyscall ()
  3    Thread 0xda27db40 (LWP 5486) "AchievementSave" 0xf77a0c90 in __kernel_vsyscall ()
  2    Thread 0xddf4cb40 (LWP 5501) "MatQueue0" 0xf77a0c90 in __kernel_vsyscall ()
* 1    Thread 0xf7120700 (LWP 5448) "hl2_linux" 0xf77a0c90 in __kernel_vsyscall ()


With multi-core rendering the 'MatQueue0' thread appears (the one which then hangs later). My guess is that without multi-core the rendering is done from the main thread.

So something in our stack (xorg-server / mesa / DDX) is not thread-safe.
Comment 32 Axel Davy 2015-03-17 07:46:35 UTC
libxcb seems to suffer some issues, and that's potentially the problem here.

Here a dri3 client implementation with workaround to make threading work:
https://github.com/axeldavy/Xnine/blob/master/dri3.c

As you can see, presenting is done with a separate xcb connection than for everything else. That prevent a bug when a client is waiting for events in a thread, and with the same connection you present a pixmap.
Also there is some care to prevent two clients from dispatch present events at the same time.
Comment 33 Axel Davy 2015-03-17 07:47:17 UTC
I meant two threads, not two clients.
Comment 34 Tobias Jakobi 2015-03-20 01:46:37 UTC
Hello Axel,

thanks for the explanation. So if I understand this correctly, then this issues comes from the fact that each XCB connection object is protected by a (single) mutex (the iolock one, if I read the libxcb code correctly), which causes trouble in Mesa' GLX DRI3 implementation because the same XCB connection is then used from two different threads.
One thread A uses the connection to wait for one event, which doesn't come because thread B is blocked (XCB mutex is locked) and therefore never reaches the point where the event is signalled.

I had a brief look at the Xnine implementation and the GLX DRI3 one, but I'm not too familiar with it to port the "two connection" approach over.
Comment 35 Clément Guérin 2015-04-09 22:05:46 UTC
This particular issue makes firefox unusable with DRI3 and OpenGL OMTC (compositor in a separate thread) activated. It usually hangs after a minute or two.
Comment 36 Axel Davy 2015-04-17 16:36:34 UTC
After looking more carefully, it looks like the glx api
is designed to be used a single-threaded way : a glx context
can be used only in one thread.

It may be a problem with the fact it is using the xcb connection from Xlib. If firefox is using Xlib in another thread, it would lead to problems (I faced this bug when writing Present support for gallium nine).

I'll soon send a patch to make it use a different xcb connection, with another small fix, and it may solve the bug.
Comment 37 Axel Davy 2015-04-17 17:36:01 UTC
Created attachment 115165 [details]
Patch proposition

This patch will make glx use a different xcb connection.

I hope it solves the bug.
I haven't checked it works yet, but it compiles.
Comment 38 Clément Guérin 2015-04-17 17:43:51 UTC
(In reply to Axel Davy from comment #37)
> Created attachment 115165 [details]
> Patch proposition
> 
> This patch will make glx use a different xcb connection.
> 
> I hope it solves the bug.
> I haven't checked it works yet, but it compiles.

I'll test ASAP. Thanks for looking into it!
Comment 39 Tobias Jakobi 2015-04-17 21:33:23 UTC
Tested this this configuration:
- multi-core rendering enabled
- in-game vsync enabled/disabled

With both configuration and the patch applied, the issue doesn't surface anymore. Repeated toggling of the ingame menu, or scene changing / load sequences, stuff that previously quickly lead to the lockup, are now working.

@Axel: Feel free to add my Tested-by.
Comment 40 Clément Guérin 2015-04-18 13:13:14 UTC
With the patch applied on top of mesa 10.5.3 and DRI3 activated, libxcb segfaults when starting GDM here.
Comment 41 Tobias Jakobi 2015-04-18 15:28:46 UTC
Due to extreme stupidness the entire content of my last post can be disregarded. I'm now going back into my corner of shame :(
Comment 42 Axel Davy 2015-04-19 19:48:47 UTC
Created attachment 115200 [details] [review]
Second Patch Proposition

New patch proposition, it should work this time.
Comment 43 Clément Guérin 2015-04-19 22:45:14 UTC
(In reply to Axel Davy from comment #42)
> Created attachment 115200 [details] [review] [review]
> Second Patch Proposition
> 
> New patch proposition, it should work this time.

No more segfault and firefox works properly now. Nice job!

Tested with UXA and SNA.
Comment 44 Axel Davy 2015-04-23 17:16:33 UTC
The proposed patch probably won't work if an app is rendering to a same drawable with several threads, which glx allows.
It can be rewritten by creating two xcb connection per drawable (one to listen events, which would be protected by mutexes, so that only one thread can dispatch events, and one to send events).

However the total number of X connection is apparently 255 for an entire server. Considering the number of applications using gl at the same time, this can't be a practical solution.

Thus the bug must be solved at the root, that is libxcb and Xlib should be fixed, and we shouldn't workaround in mesa.
Comment 45 Thomas Daede 2015-04-24 21:00:26 UTC
This bug affects Servo on Linux as well:

https://github.com/servo/servo/issues/5789

Each painting thread on Servo draws to an XPixmap, which are then passed to the compositing thread which converts them to textures and paints them to the framebuffer.

The proposed patch fixes it for me. Oddly, I run e10s + OMTC firefox, and I have never hit this bug in it.

Are there any workarounds that can be used in the meantime?
Comment 46 Michel Dänzer 2015-04-28 01:12:02 UTC
BTW, this can be worked around by setting the environment variable LIBGL_DRI3_DISABLE=1 for the affected apps.
Comment 47 Tobias Jakobi 2015-04-28 08:53:37 UTC
This doesn't work for me since I have to have some custom settings in my xorg.conf to correct panel DPI. This makes it impossible to load the second DDX so that PRIME over DRI2 works.

So for me PRIME over DRI3 is the only way to use the dedicated GPU on this system.
Comment 48 Axel Davy 2015-05-16 09:12:37 UTC
Created attachment 115827 [details]
Test that triggers the hang

This is a test that triggers (sometimes - you may have to try several times) the event loss.
Comment 49 Michel Dänzer 2015-05-19 02:14:35 UTC
Created attachment 115887 [details] [review]
XCB special event wakeup fixes

This libxcb patch fixes the attached test case by Axel for me. Does it also help for the other use cases reported here (with Axel's previous Mesa patches reverted)?
Comment 50 Thomas Daede 2015-05-19 21:41:55 UTC
I'm able to confirm that your libxcb patch fixes Servo, with my system Mesa reverted back to Fedora's unpatched version.
Comment 51 Clément Guérin 2015-05-20 18:14:02 UTC
Fixes it for me as well.
Comment 52 Uli Schlachter 2015-06-12 07:55:27 UTC
Part of the patch was pushed:

commit f85661c3bca97faa72431df92a3867be39a74e23
Author: Michel Dänzer <michel.daenzer@amd.com>
Date:   Mon Jun 1 11:04:18 2015 +0900

    Call _xcb_wake_up_next_reader from xcb_wait_for_special_event
    
    All functions calling _xcb_conn_wait() must make sure that waiting
    readers are woken up when we read a reply or event that they are waiting
    for. xcb_wait_for_special_event() did not do so. This adds the missing
    call to_xcb_in_wake_up_next_reader().
    
    Fixes deadlock when waiting for a special event and concurrently
    processing the display connection queue in another thread.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=84252
    Tested-by: Thomas Daede <bztdlinux@gmail.com>
    Tested-by: Clément Guérin <geecko.dev@free.fr>
    Reviewed-by: Uli Schlachter <psychon@znc.in>
    Signed-off-by: Michel Dänzer <michel@daenzer.net>
    Signed-off-by: Uli Schlachter <psychon@znc.in>

diff --git a/src/xcb_in.c b/src/xcb_in.c
index 623a0a8..322bed8 100644
--- a/src/xcb_in.c
+++ b/src/xcb_in.c
@@ -761,6 +761,7 @@ xcb_generic_event_t *xcb_wait_for_special_event(xcb_connection_t *c,
         if(!_xcb_conn_wait(c, &se->special_event_cond, 0, 0))
             break;
 
+    _xcb_in_wake_up_next_reader(c);
     pthread_mutex_unlock(&c->iolock);
     return event;
 }
Comment 53 Uli Schlachter 2015-06-12 13:12:10 UTC
Created attachment 116462 [details]
Another test that runs into a different, but quite similar hang
Comment 54 Uli Schlachter 2015-06-12 13:24:42 UTC
Created attachment 116463 [details] [review]
Make _xcb_in_wake_up_next_reader() wake up threads in xcb_wait_for_special_event()

Can someone check if this patch for libxcb also fixes the bug?

This replaces half of Michel's patch, so you would still need this part from that other patch (this part is already committed to libxcb's git master, so you could also just test master + this patch):

diff --git a/src/xcb_in.c b/src/xcb_in.c
index 623a0a8..2ce5a56 100644
--- a/src/xcb_in.c
+++ b/src/xcb_in.c
@@ -761,6 +761,7 @@ xcb_generic_event_t *xcb_wait_for_special_event(xcb_connection_t *c,
         if(!_xcb_conn_wait(c, &se->special_event_cond, 0, 0))
             break;
 
+    _xcb_in_wake_up_next_reader(c);
     pthread_mutex_unlock(&c->iolock);
     return event;
 }


Thanks for testing
Comment 55 Rex Dieter 2015-06-17 15:05:42 UTC
Fwiw, seeing what appear to be related deadlocks in plasma5 environment too, see downstream bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1193742

and the proposed fix/workaround from comment #49 seems to help.
Comment 56 Uli Schlachter 2015-06-25 18:40:49 UTC
A second part of this was pushed as commit 5b40681c887192307f3ae147d2158870aa79c05f. This should hopefully fix this.
Comment 57 Rex Dieter 2015-07-08 15:21:14 UTC
I've a downstream user
https://bugzilla.redhat.com/show_bug.cgi?id=1193742#c45
that recently experienced another deadlock (in plasmashell), using the latest commits referenced in this report.

#0  0x00000035130f666d in poll () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000003d9c60a182 in _xcb_conn_wait (__timeout=-1, __nfds=1, __fds=0x7ffe1eecb730) at /usr/include/bits/poll2.h:46
#2  0x0000003d9c60a182 in _xcb_conn_wait (c=c@entry=0x1007ff0, cond=cond@entry=0x58bc8f8, vector=vector@entry=0x0, count=count@entry=0x0) at xcb_conn.c:459
#3  0x0000003d9c60bd49 in xcb_wait_for_special_event (c=c@entry=0x1007ff0, se=0x58bc8d0) at xcb_in.c:744
#4  0x0000003708e49e72 in dri3_find_back (c=c@entry=0x1007ff0, priv=priv@entry=0x87cd3a0) at dri3_glx.c:1263
#5  0x0000003708e4ac8f in dri3_get_buffers (driDrawable=0x87cd3a0, loaderPrivate=0x87cd3a0, buffer_type=dri3_buffer_back, format=4107) at dri3_glx.c:1289
#6  0x0000003708e4ac8f in dri3_get_buffers (driDrawable=driDrawable@entry=0x463e0a0, format=4107, stamp=stamp@entry=0x463e0d0, loaderPrivate=loaderPrivate@entry=0x87cd3a0, buffer_mask=<optimized out>, buffer_mask@entry=1, buffers=buffers@entry=0x7ffe1eecb980) at dri3_glx.c:1466
#7  0x00007f09f73ae14b in intel_update_renderbuffers (drawable=0x463e0a0, brw=0x2b06018) at brw_context.c:1527
#8  0x00007f09f73ae14b in intel_update_renderbuffers (context=context@entry=0x2af37b0, drawable=drawable@entry=0x463e0a0) at brw_context.c:1242
#9  0x00007f09f73ae4a1 in intel_prepare_render (brw=brw@entry=0x2b06018) at brw_context.c:1263
#10 0x00007f09f73a21c0 in brw_clear (ctx=0x2b06018, mask=50) at brw_clear.c:234
#11 0x0000003715f28bfa in QSGBatchRenderer::Renderer::renderBatches() (this=this@entry=0x8730b40)


Is this the same issue or should we open a separate report?
Comment 58 Michel Dänzer 2015-07-10 06:44:43 UTC
(In reply to Rex Dieter from comment #57)
> Is this the same issue or should we open a separate report?

Please file a separate report. Note that this could be caused by a bug in the X server / driver.
Comment 59 Nikola Šnele 2016-01-25 21:01:04 UTC
I am having this issue in Manjaro (Archlinux based) with Plasma 5.

I can reproduce the issue easily with 
xset dpms force off; sleep 2; kdialog --passivepopup foo 5

Changing to DRI2 fixes the issue.

Should I file a separate bug report?
Comment 60 Tim Writer 2016-01-25 21:16:33 UTC
Created attachment 121283 [details]
attachment-29921-0.html

?I'm out of the office due to illness. I hope to check e-mail from time to time but responses will be delayed.

Regards,
Tim
Comment 61 Michel Dänzer 2016-01-27 07:02:44 UTC
(In reply to Nikola Šnele from comment #59)
> Should I file a separate bug report?

Assuming your libxcb has the fix referenced in this report, yes please.
Comment 62 Martin Peres 2016-02-11 07:58:55 UTC
(In reply to Nikola Šnele from comment #59)
> I am having this issue in Manjaro (Archlinux based) with Plasma 5.
> 
> I can reproduce the issue easily with 
> xset dpms force off; sleep 2; kdialog --passivepopup foo 5
> 
> Changing to DRI2 fixes the issue.
> 
> Should I file a separate bug report?

I also have found this issue with notify-send, but it did not work as reliably.

However, does it work on a fresh XServer for you? For me, I have to wait multiple days before I can reproduce this issue. I made an xtrace ... but failed to store it thinking I would just test first but then, plasmashell did not want to start again and I had to kill X to get it to work again. I will have to wait a few more days to get this one.

(In reply to Michel Dänzer from comment #61)
> (In reply to Nikola Šnele from comment #59)
> > Should I file a separate bug report?
> 
> Assuming your libxcb has the fix referenced in this report, yes please.

Yes, I can reproduce this bug with a fixed libxcb. I will create a new bug report as soon as I get the xtrace.
Comment 63 Nikola Šnele 2016-02-11 08:15:07 UTC
@Martin Peres

Yes I have to wait day or two of uptime to get hit by the bug. On fresh boot I cannot reproduce it. But when freeze does happen I can easily reproduce it everytime until next restart/boot up.

When freeze does happen, I have to kill plasmashell and run it again to unfreeze it, and then I can reproduce the bug with 
xset dpms force off; sleep 2; kdialog --passivepopup foo 5
or sending ping notification from kdeconnect (any notification when display is in dpms off state will freeze plasmashell).

Can you tell me how to get xtrace (backtrace)? I am noob about this stuff.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.