Hello, this is a cleaned up version of (the now closed) bug #81548. Software stack: xorg-server git tip mesa git tip libdrm git tip libxcb git tip vanilla kernel 3.16.3 Primary GPU is a hsw/gen7.5 (i7-4700HQ), the secondary GPU is a R9 M265X. DRI3/PRIME is used. Issue: My test application is the x86 variant of HL2 Ep1. After loading a specific savegame I can reliably get the game to "hang". Attaching gdb yields the backtrace given below. Affected configurations: DRI3 with SNA backend enabled in the DDX. Switching to the UXA backend removes the issue, which (maybe?) points at synchronisation issues that are discussed in bug #81551. The patch proposed in the bugreport is applied, but doesn't cure the issue. Backtrace: #0 0xf771cd10 in __kernel_vsyscall () (gdb) bt #0 0xf771cd10 in __kernel_vsyscall () #1 0xf7348c8c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0 #2 0xf75e55cc in pthread_cond_wait () from /lib/libc.so.6 #3 0xf70b751f in ?? () from /usr/lib/libxcb.so.1 #4 0xf70c13a5 in xcb_wait_for_special_event () from /usr/lib/libxcb.so.1 #5 0xf7498b49 in dri3_find_back (c=c@entry=0xa410000, priv=priv@entry=0x9fe23f0) at dri3_glx.c:1191 #6 0xf749932b in dri3_get_buffer (format=4099, buffer_type=buffer_type@entry=dri3_buffer_back, loaderPrivate=loaderPrivate@entry=0x9fe23f0, driDrawable=<optimized out>) at dri3_glx.c:1217 #7 0xf7499fd0 in dri3_get_buffers (driDrawable=0xa455110, format=4099, stamp=0x9e706e0, loaderPrivate=0x9fe23f0, buffer_mask=<optimized out>, buffers=0xb8ce6bec) at dri3_glx.c:1394 #8 0xf43a4cf3 in dri_image_drawable_get_buffers (statts_count=<optimized out>, statts=<optimized out>, images=<optimized out>, drawable=<optimized out>) at dri2.c:254 #9 dri2_allocate_textures (ctx=0xa455590, drawable=0x9e706e0, statts=0x9ff2b80, statts_count=2) at dri2.c:377 #10 0xf43a199c in dri_st_framebuffer_validate (stctx=0xa579000, stfbi=0x9e706e0, statts=0x9ff2b80, count=2, out=0xb8ce6cc0) at dri_drawable.c:83 #11 0xf42d0378 in st_framebuffer_validate (stfb=stfb@entry=0x9ff2800, st=st@entry=0xa579000) at ../../src/mesa/state_tracker/st_manager.c:200 #12 0xf42d1bb7 in st_manager_validate_framebuffers (st=st@entry=0xa579000) at ../../src/mesa/state_tracker/st_manager.c:862 #13 0xf4292bf6 in st_validate_state (st=st@entry=0xa579000) at ../../src/mesa/state_tracker/st_atom.c:180 #14 0xf429a6e5 in st_BlitFramebuffer (ctx=0x9ef0000, srcX0=0, srcY0=0, srcX1=1920, srcY1=1080, dstX0=0, dstY0=1080, dstX1=1920, dstY1=0, mask=16384, filter=9728) at ../../src/mesa/state_tracker/st_cb_blit.c:94 #15 0xf41358c3 in _mesa_BlitFramebuffer (srcX0=0, srcY0=0, srcX1=1920, srcY1=1080, dstX0=0, dstY0=1080, dstX1=1920, dstY1=0, mask=16384, filter=9728) at ../../src/mesa/main/blit.c:509 #16 0xf5831e13 in ?? () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/libtogl.so #17 0xf5832617 in ?? () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/libtogl.so #18 0xf5823c3b in IDirect3DDevice9::Present(_RECT const*, _RECT const*, void*, RGNDATA const*) () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/libtogl.so #19 0xecd6646d in ?? () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/shaderapidx9.so #20 0xf14edfde in ?? () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/materialsystem.so #21 0xf14ee31d in ?? () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/materialsystem.so #22 0xf14ce644 in ?? () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/materialsystem.so #23 0xf14d684a in ?? () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/materialsystem.so #24 0xf14d6697 in ?? () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/materialsystem.so #25 0xf5d0b49a in ?? () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/libvstdlib.so #26 0xf5df29e0 in CThread::ThreadProc(void*) () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/libtier0.so #27 0xf7344f45 in start_thread () from /lib/libpthread.so.0 #28 0xf75d88ee in clone () from /lib/libc.so.6 With best wishes, Tobias
Compile with the ddx --enable-debug=full and prepare to be amused.
Done, here's the log: http://www.math.uni-bielefeld.de/~tjakobi/archive/Xorg.0.log.xz
Every single present flip, unflip and vblank wait is completed and notified, all appear to be done so on the next vblank. Looks like present/mesa didn't coordinate their fence usage correctly.
Please try and capture the X command stream e.g. using xtrace. The other suspicion is that there was an error sent which the application didn't notice as it is waiting upon the special event.
In the debug log, there is one oddity: mesa reuses an active pixmap. [ 90578.146] sna_dri3_pixmap_from_fd(fd=48, width=1000, height=678, stride=4096, depth=24, bpp=32) [ 90578.146] kgem_create_for_prime(name=48) [ 90578.146] kgem_create_for_prime: estimated size=2777088, actual=2777088 [ 90578.146] kgem_create_for_prime: new handle=22, tiling=0 [ 90578.146] __kgem_bo_init(handle=22, num_pages=678) [ 90578.146] kgem_create_for_prime: imported handle=22 has caching 0 [ 90578.146] kgem_create_for_prime: interpreting handle=22 as a foreign scanout [ 90578.146] kgem_check_surface_size(width=1000, height=678, bpp=32, tiling=0, pitch=4096, size=2777088) [ 90578.146] kgem_surface_size: tile_width=8, tile_height=1 => aligned pitch=4000, height=678 [ 90578.146] kgem_check_surface_size: min_pitch=4000, min_size=2715648 [ 90578.146] kgem_check_surface_size: tile_width=1, tile_size=1 [ 90578.146] create_pixmap: allocating pixmap 0x0, depth=24, size=0 [ 90578.146] create_pixmap: serial=11003, usage=-1, 0x0 [ 90578.146] sna_pixmap_attach_to_bo: attaching GPU handle=22 to pixmap=11004 ... [ 90578.157] sna_dri3_pixmap_from_fd(fd=48, width=1000, height=678, stride=4096, depth=24, bpp=32) [ 90578.157] kgem_create_for_prime(name=48) [ 90578.157] kgem_create_for_prime: estimated size=2777088, actual=2777088 [ 90578.157] kgem_create_for_prime: new handle=22, tiling=0 [ 90578.157] __kgem_bo_init(handle=22, num_pages=678) [ 90578.157] kgem_create_for_prime: imported handle=22 has caching 0 [ 90578.157] kgem_create_for_prime: interpreting handle=22 as a foreign scanout [ 90578.157] sna_dri3_pixmap_from_fd: imported fd matches existing DRI3 pixmap=11004 [ 90578.157] _kgem_bo_destroy: handle=0, proxy? 0 [ 90578.157] __kgem_bo_destroy: handle=0, size=2777088 [ 90578.157] kgem_bo_free: handle=0, size=2777088 [ 90578.157] kgem_bo_free: releasing 0x0:0x0 vma for handle=0, count=0 [ 90578.157] sna_sync_flush(pixmap=11004) [ 90578.157] sna_sync_flush: flushing prime GPU bo, handle=22 [ 90578.157] sna_pixmap_move_to_gpu(pixmap=11004, usage=-1), flags=4b [ 90578.157] sna_pixmap_move_to_gpu: already all-damaged [ 90578.157] sna_pixmap_mark_active: pixmap=11004, handle= But that looks to be used for copying onto the scanout, so the compositor. Why it is not tiled is beyond, but that's the clients perogative. Anyway, my guess is that it is unrelated to the game.
What do you mean by compositor? Desktop compositor? Because there is none active here.
(In reply to comment #6) > What do you mean by compositor? Desktop compositor? Because there is none > active here. Just a compositing WM. Then it may be related, or not as it seems to be not involved in the flipping.
I use Xfce as DE, but I don't use the compositor in xfwm4. So no compositor is active here. I still need to figure out how to properly setup xtrace in the chroot. Logs maybe later.
I still find the reused Pixmap suspicious. It is only used in conjunction with Present, so I can't see a reason for an extra refcount inside the Xserver, and for mesa to reallocate the same buffer means it must have freed it and the Pixmap. I would be interested to see if: diff --git a/src/mesa/drivers/dri/i965/brw_context.c b/src/mesa/drivers/dri/i965/brw_context.c index fa0210c..c8c2dfe 100644 --- a/src/mesa/drivers/dri/i965/brw_context.c +++ b/src/mesa/drivers/dri/i965/brw_context.c @@ -586,7 +586,7 @@ brw_process_driconf_options(struct brw_context *brw) case DRI_CONF_BO_REUSE_DISABLED: break; case DRI_CONF_BO_REUSE_ALL: - intel_bufmgr_gem_enable_reuse(brw->bufmgr); + //intel_bufmgr_gem_enable_reuse(brw->bufmgr); break; } (there's also a drirc option bo_reuse=0) prevents "sna_dri3_pixmap_from_fd: imported fd matches existing DRI3 pixmap" I haven't seen that with a radeon hd7700/snb combo yet.
I can't get xtrace to work. I'm not sure if it's due to it not understanding the DRI3 protocol (it doesn't have a 'template' file for DRI3 and Present), or because of different reasons. At least I can't get even glxgears to work properly (it hangs before showing a window). DRI2 works though. Any other ideas for tools useable to trace this? Concerning the BO reuse thing. I've disabled bo_reuse but the issue still happens, same backtrace as before.
backtrace from glxgears when it hangs: #0 0xf7752d10 in __kernel_vsyscall () #1 0xf737ef53 in poll () from /lib/libc.so.6 #2 0xf7081046 in _xcb_conn_wait (c=0x8442a78, cond=0xfff65cdc, vector=0x0, count=0x0) at /var/tmp/portage/x11-libs/libxcb-9999/work/libxcb-9999/src/xcb_conn.c:479 #3 0xf7082ec3 in wait_for_reply (c=0x8442a78, request=23, e=0x0) at /var/tmp/portage/x11-libs/libxcb-9999/work/libxcb-9999/src/xcb_in.c:491 #4 0xf7083018 in xcb_wait_for_reply (c=0x8442a78, request=23, e=0x0) at /var/tmp/portage/x11-libs/libxcb-9999/work/libxcb-9999/src/xcb_in.c:521 #5 0xf70d7db0 in xcb_dri3_open_reply (c=c@entry=0x8442a78, cookie=..., e=e@entry=0x0) at dri3.c:146 #6 0xf7621619 in dri3_open (provider=0, root=156, dpy=<optimized out>) at dri3_glx.c:1573 #7 dri3_create_screen (screen=0, priv=0x844c618) at dri3_glx.c:1812 #8 0xf75f563a in AllocAndFetchScreenConfigs (priv=0x844c618, dpy=0x8442008) at glxext.c:788 #9 __glXInitialize (dpy=dpy@entry=0x8442008) at glxext.c:902 #10 0xf75f1b7d in GetGLXPrivScreenConfig (dpy=dpy@entry=0x8442008, scrn=scrn@entry=0, ppriv=ppriv@entry=0xfff65e84, ppsc=ppsc@entry=0xfff65e88) at glxcmds.c:172 #11 0xf75f241e in GetGLXPrivScreenConfig (ppsc=0xfff65e88, ppriv=0xfff65e84, scrn=0, dpy=0x8442008) at glxcmds.c:168 #12 glXChooseVisual (dpy=0x8442008, screen=0, attribList=0xfff66080) at glxcmds.c:1249 #13 0x0804b2aa in ?? () #14 0x08049573 in ?? () #15 0xf72b6943 in __libc_start_main () from /lib/libc.so.6 #16 0x08049f01 in ?? ()
Created attachment 106805 [details] xtrace output
(In reply to comment #10) > I can't get xtrace to work. I'm not sure if it's due to it not understanding > the DRI3 protocol (it doesn't have a 'template' file for DRI3 and Present), > or because of different reasons. At least I can't get even glxgears to work > properly (it hangs before showing a window). DRI2 works though. > > Any other ideas for tools useable to trace this? It's just lacking the protocol definition, the data is still in the xtrace, just need to learn hex... Does it hang as fast as the xtrace suggests? Right at the very first request? > Concerning the BO reuse thing. I've disabled bo_reuse but the issue still > happens, same backtrace as before. Right, that's reassuring, but I am curious as to whether that prevents it from trying to construct two Pixmaps from the same handle - i.e. the splat in the full debug log. Most important question is does it really hang that early?
(In reply to comment #13) > It's just lacking the protocol definition, the data is still in the xtrace, > just need to learn hex... > > Does it hang as fast as the xtrace suggests? Right at the very first request? Yes, it hangs immediately (on DRI3Open). Like I said, there is no window created. > Right, that's reassuring, but I am curious as to whether that prevents it > from trying to construct two Pixmaps from the same handle - i.e. the splat > in the full debug log. So you need another full debug log with bo_reuse disabled? > Most important question is does it really hang that early? You mean with bo_reuse? The issue appears almost immediately after loading the savegame. I load the savegame, then I turn the camera and the image freezes. This is less than 2 seconds in-game. Now with full debug it looks longer, I could take some steps, etc. -- let it be maybe 6 seconds. I haven't noticed these difference between bo_reuse enabled and disabled.
Oh drat, xtrace is not going to work without some serious magic to handle the fd passing. That's a nuisance. Oh well, might as well see what the xserver debugging: diff --git a/present/present.c b/present/present.c index e838da7..4f1ea49 100644 --- a/present/present.c +++ b/present/present.c @@ -36,7 +36,7 @@ static uint64_t present_event_id; static struct xorg_list present_exec_queue; static struct xorg_list present_flip_queue; -#if 0 +#if 1 #define DebugPresent(x) ErrorF x #else #define DebugPresent(x) turns up.
(In reply to comment #14) > > Right, that's reassuring, but I am curious as to whether that prevents it > > from trying to construct two Pixmaps from the same handle - i.e. the splat > > in the full debug log. > So you need another full debug log with bo_reuse disabled? Right, might as well combine that with DebugPresent since it doesn't affect the reproduction of the hang. > > Most important question is does it really hang that early? > You mean with bo_reuse? Nah, I was thinking about the xtrace, hoping that had some insight into the hang. But that's a lost cause for now :(
One possible explanation: xserver commit 9bc01dfc7070a40f5948588895b3a11dd1636d0e Author: Axel Davy <axel.davy@ens.fr> Date: Sat Sep 27 23:17:13 2014 +0200 Fix present_pixmap when using present_notify_msc Calling present_notify_msc could cancel a pending pixmap presentation. Signed-off-by: Axel Davy <axel.davy@ens.fr> Reviewed-by: Keith Packard <keithp@keithp.com> Signed-off-by: Keith Packard <keithp@keithp.com>
*** Bug 84926 has been marked as a duplicate of this bug. ***
I haven't found time yet to create the traces. Just some observations: Updating mesa fixes the issue with my current ingame configuration (which is vsync off). I tracked this down to [url=http://cgit.freedesktop.org/mesa/mesa/commit/?id=f7a355556ef5fe23056299a77414f9ad8b5e5a1d]glx/dri3: Use four buffers until X driver supports async flips[/url]. I had vsync disabled since initially I wanted to benchmark the HSW. With PRIME and vsync disabled ingame the rendering however is very 'shacky', to the point where it's not playable. I thus re-enabled vsync and the issue reappeared (strange I guess). Initial tests show that [url=https://www.freedesktop.org/patch/34966/]Correct present_notify_msc vblank target[/url] seems to fix the issue (with vsync on).
Guess I shouldn't use bbcode here :(
Finally managed to get this trace. This is with xorg-server git tip (63bb5c5ef16edf652179770294dcca4fc07dc992), mesa git tip (aafbebe8ab64012fe625037c7773fa779b6f442f) and vanilla 3.17.3. xserver was patched with patch from comment #15 to enable debugging of the Present extension. xf86-video-intel was build with full debug. bo_reuse was disabled and vsync was switched on ingame. Here's the compressed log: http://www.math.uni-bielefeld.de/~tjakobi/archive/xorg.log.xz Also note that the impression from comment #19 must have been a fluke ("Correct present_notify_msc vblank target" doesn't seem to have any effect on the freeze).
Flips proceeds fine up until the final unflip is completed (event 1487). After that, present is copying invisible windows (until event 1565, the last event), but as far as I can see each copy is still followed by an idle notification. The unflip is for when the window becomes clipped, and present switches back to the original pixmap and keeps on doing the clipped copy updates. As far as the trace goes, there is not a missing idle event.
Would it help to create another trace without "[PATCH] sna/dri3: Mesa relies upon implicit fences" applied to xf86-video-intel?
(In reply to Tobias Jakobi from comment #23) > Would it help to create another trace without "[PATCH] sna/dri3: Mesa relies > upon implicit fences" applied to xf86-video-intel? No, that shows up the in log and doesn't actually change the ordering of (Present) events. At the very least, the bug still occurs and I can see all the server side idle events that I would expect (and since it appears to continue to process PresentPixmaps, I am at a loss).
Hitting this a lot with VirtualBox trunk builds running guests with 3D pass-through enabled on an Ubuntu 14.10 host system. Usually when I resize the window/guest screen.
Update: xserver git master 826e7c2b36f192fbbe7ddff37eb559f4d6301146 xf86-video-intel git master 0d42b0ed25d4112e0b3e3218e5c42947bbeb9e27 DRI3 is enabled. In-game vsync is on. I haven't managed to hit the issue with this configuration. I'm cautious though (this needs more tests), so I'm leaving it open for now. My guess is that the recent PRESENT patches by Mario Kleiner might have fixed it. What I noticed however is that tearing still happens (even though vsync is on). This is probably due to PRIME.
OK, so I did some extensive tests today and while the issue is now harder to trigger, it is still there. A kinda reliable way to trigger it, is to toggle between in-game and main menu a few times.
Not sna specific, reports of getting stuck in wait_for_specific_event with uxa as well.
Hmm, that is strange. The last time I tested this with UXA I couldn't trigger the issue. I guess I need to recheck this more thoroughly.
I retested today with updated stack: mesa (29c23644cc8cfe9a1cdd75f79a96a1c9b49d26fa) xf86-video-intel (db82617464e55432522e6199a88408ff0187f6ff) xserver (3a06faf3fcdb7451125a46181f9152e8e59e9770) Issue is still present (with SNA): #0 0xf77afc90 in __kernel_vsyscall () #1 0xf73dac8c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0 #2 0xf76745cc in pthread_cond_wait () from /lib/libc.so.6 #3 0xf7192643 in ?? () from /usr/lib/libxcb.so.1 #4 0xf7194d63 in xcb_wait_for_special_event () from /usr/lib/libxcb.so.1 #5 0xf7524bd9 in dri3_find_back (c=c@entry=0x9fc0000, priv=priv@entry=0x9ffe000) at dri3_glx.c:1263 #6 0xf752542b in dri3_get_buffer (format=4099, buffer_type=buffer_type@entry=dri3_buffer_back, loaderPrivate=loaderPrivate@entry=0x9ffe000, driDrawable=<optimized out>) at dri3_glx.c:1289 #7 0xf752624d in dri3_get_buffers (driDrawable=0xa0a05d0, format=4099, stamp=0x9b9cc60, loaderPrivate=0x9ffe000, buffer_mask=<optimized out>, buffers=0xd460cbec) at dri3_glx.c:1466 #8 0xf442c483 in dri_image_drawable_get_buffers (statts_count=<optimized out>, statts=<optimized out>, images=<optimized out>, drawable=<optimized out>) at dri2.c:254 #9 dri2_allocate_textures (ctx=0xa0a0c00, drawable=0x9b9cc60, statts=0x9ba0b80, statts_count=2) at dri2.c:377 #10 0xf44290ec in dri_st_framebuffer_validate (stctx=0x9fee000, stfbi=0x9b9cc60, statts=0x9ba0b80, count=2, out=0xd460ccc0) at dri_drawable.c:83 #11 0xf434b9b8 in st_framebuffer_validate (stfb=stfb@entry=0x9ba0800, st=st@entry=0x9fee000) at state_tracker/st_manager.c:200 #12 0xf434d1f7 in st_manager_validate_framebuffers (st=st@entry=0x9fee000) at state_tracker/st_manager.c:863 #13 0xf430c866 in st_validate_state (st=st@entry=0x9fee000) at state_tracker/st_atom.c:180 #14 0xf4312f7c in st_BlitFramebuffer (ctx=0xa18a000, readFB=0xb99e380, drawFB=0x9ba0800, srcX0=0, srcY0=0, srcX1=1920, srcY1=1080, dstX0=0, dstY0=1080, dstX1=1920, dstY1=0, mask=16384, filter=9728) at state_tracker/st_cb_blit.c:94 #15 0xf41adc17 in _mesa_BlitFramebuffer (srcX0=0, srcY0=0, srcX1=1920, srcY1=1080, dstX0=0, dstY0=1080, dstX1=1920, dstY1=0, mask=16384, filter=9728) at main/blit.c:509 #16 0xf58d8e13 in ?? () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/libtogl.so #17 0xf58d9617 in ?? () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/libtogl.so #18 0xf58cac3b in IDirect3DDevice9::Present(_RECT const*, _RECT const*, void*, RGNDATA const*) () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/libtogl.so #19 0xecc8146d in ?? () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/shaderapidx9.so #20 0xf137bfde in ?? () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/materialsystem.so #21 0xf137c31d in ?? () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/materialsystem.so #22 0xf135c644 in ?? () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/materialsystem.so #23 0xf136484a in ?? () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/materialsystem.so #24 0xf1364697 in ?? () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/materialsystem.so #25 0xf5db349a in ?? () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/libvstdlib.so #26 0xf5e9a9e0 in CThread::ThreadProc(void*) () from /mnt/extern/superNova/steam-native/SteamApps/common/Half-Life 2/bin/libtier0.so #27 0xf73d6f45 in start_thread () from /lib/libpthread.so.0 #28 0xf76678ee in clone () from /lib/libc.so.6 I haven't checked UXA yet.
I tried to triage this further. The issue is a threading problem. I can 'fix' this reliably by disabling (in-game option) the multi-core rendering. Here's the thread information for both configurations. Multi-core rendering disabled: (gdb) info threads Id Target Id Frame 20 Thread 0xf5344b40 (LWP 5361) "hl2_linux" 0xf77b7c90 in __kernel_vsyscall () 19 Thread 0xf4b43b40 (LWP 5362) "IOJob0" 0xf77b7c90 in __kernel_vsyscall () 18 Thread 0xf2a9cb40 (LWP 5364) "hl2_linux" 0xf77b7c90 in __kernel_vsyscall () 17 Thread 0xeb3f3b40 (LWP 5365) "UICEFThread" 0xf77b7c90 in __kernel_vsyscall () 16 Thread 0xeb2f2b40 (LWP 5366) "BrowserDBThread" 0xf77b7c90 in __kernel_vsyscall () 15 Thread 0xeaaf1b40 (LWP 5367) "SignalSender" 0xf77b7c90 in __kernel_vsyscall () 14 Thread 0xea8dab40 (LWP 5368) "NetworkChangeNo" 0xf77b7c90 in __kernel_vsyscall () 13 Thread 0xea0d9b40 (LWP 5369) "inotify_reader" 0xf77b7c90 in __kernel_vsyscall () 12 Thread 0xe98d8b40 (LWP 5370) "Cef_FileThread" 0xf77b7c90 in __kernel_vsyscall () 11 Thread 0xe90d7b40 (LWP 5371) "Cef_IOThread" 0xf77b7c90 in __kernel_vsyscall () 10 Thread 0xe88d6b40 (LWP 5372) "AppCacheDBThrea" 0xf77b7c90 in __kernel_vsyscall () 9 Thread 0xe7e0fb40 (LWP 5373) "hl2_linux" 0xf77b7c90 in __kernel_vsyscall () 8 Thread 0xe7af4b40 (LWP 5374) "SDLAudioDev2" 0xf77b7c90 in __kernel_vsyscall () 7 Thread 0xda757b40 (LWP 5380) "CmpJob0" 0xf77b7c90 in __kernel_vsyscall () 6 Thread 0xda656b40 (LWP 5381) "CmpJob1" 0xf77b7c90 in __kernel_vsyscall () 5 Thread 0xda555b40 (LWP 5382) "CmpJob2" 0xf77b7c90 in __kernel_vsyscall () 4 Thread 0xd5453b40 (LWP 5383) "SaveJob0" 0xf77b7c90 in __kernel_vsyscall () 3 Thread 0xd5352b40 (LWP 5384) "QueuedPacketSen" 0xf77b7c90 in __kernel_vsyscall () 2 Thread 0xd007db40 (LWP 5385) "AchievementSave" 0xf77b7c90 in __kernel_vsyscall () * 1 Thread 0xf7137700 (LWP 5360) "hl2_linux" 0xf77b7c90 in __kernel_vsyscall () Multi-core rendering enabled: (gdb) info threads Id Target Id Frame 21 Thread 0xf532db40 (LWP 5449) "hl2_linux" 0xf77a0c90 in __kernel_vsyscall () 20 Thread 0xf4b2cb40 (LWP 5450) "IOJob0" 0xf77a0c90 in __kernel_vsyscall () 19 Thread 0xf2a85b40 (LWP 5461) "hl2_linux" 0xf77a0c90 in __kernel_vsyscall () 18 Thread 0xeb3dcb40 (LWP 5463) "UICEFThread" 0xf77a0c90 in __kernel_vsyscall () 17 Thread 0xeb2dbb40 (LWP 5464) "BrowserDBThread" 0xf77a0c90 in __kernel_vsyscall () 16 Thread 0xeaadab40 (LWP 5465) "SignalSender" 0xf77a0c90 in __kernel_vsyscall () 15 Thread 0xea8c3b40 (LWP 5469) "NetworkChangeNo" 0xf77a0c90 in __kernel_vsyscall () 14 Thread 0xea0c2b40 (LWP 5471) "inotify_reader" 0xf77a0c90 in __kernel_vsyscall () 13 Thread 0xe98c1b40 (LWP 5472) "Cef_FileThread" 0xf77a0c90 in __kernel_vsyscall () 12 Thread 0xe90c0b40 (LWP 5473) "Cef_IOThread" 0xf77a0c90 in __kernel_vsyscall () 11 Thread 0xe88bfb40 (LWP 5475) "AppCacheDBThrea" 0xf77a0c90 in __kernel_vsyscall () 10 Thread 0xe7df8b40 (LWP 5476) "hl2_linux" 0xf77a0c90 in __kernel_vsyscall () 9 Thread 0xe7addb40 (LWP 5477) "SDLAudioDev2" 0xf77a0c90 in __kernel_vsyscall () 8 Thread 0xe4895b40 (LWP 5481) "CmpJob0" 0xf77a0c90 in __kernel_vsyscall () 7 Thread 0xe4794b40 (LWP 5482) "CmpJob1" 0xf77a0c90 in __kernel_vsyscall () 6 Thread 0xe4693b40 (LWP 5483) "CmpJob2" 0xf77a0c90 in __kernel_vsyscall () 5 Thread 0xdf591b40 (LWP 5484) "SaveJob0" 0xf77a0c90 in __kernel_vsyscall () 4 Thread 0xdf490b40 (LWP 5485) "QueuedPacketSen" 0xf77a0c90 in __kernel_vsyscall () 3 Thread 0xda27db40 (LWP 5486) "AchievementSave" 0xf77a0c90 in __kernel_vsyscall () 2 Thread 0xddf4cb40 (LWP 5501) "MatQueue0" 0xf77a0c90 in __kernel_vsyscall () * 1 Thread 0xf7120700 (LWP 5448) "hl2_linux" 0xf77a0c90 in __kernel_vsyscall () With multi-core rendering the 'MatQueue0' thread appears (the one which then hangs later). My guess is that without multi-core the rendering is done from the main thread. So something in our stack (xorg-server / mesa / DDX) is not thread-safe.
libxcb seems to suffer some issues, and that's potentially the problem here. Here a dri3 client implementation with workaround to make threading work: https://github.com/axeldavy/Xnine/blob/master/dri3.c As you can see, presenting is done with a separate xcb connection than for everything else. That prevent a bug when a client is waiting for events in a thread, and with the same connection you present a pixmap. Also there is some care to prevent two clients from dispatch present events at the same time.
I meant two threads, not two clients.
Hello Axel, thanks for the explanation. So if I understand this correctly, then this issues comes from the fact that each XCB connection object is protected by a (single) mutex (the iolock one, if I read the libxcb code correctly), which causes trouble in Mesa' GLX DRI3 implementation because the same XCB connection is then used from two different threads. One thread A uses the connection to wait for one event, which doesn't come because thread B is blocked (XCB mutex is locked) and therefore never reaches the point where the event is signalled. I had a brief look at the Xnine implementation and the GLX DRI3 one, but I'm not too familiar with it to port the "two connection" approach over.
This particular issue makes firefox unusable with DRI3 and OpenGL OMTC (compositor in a separate thread) activated. It usually hangs after a minute or two.
After looking more carefully, it looks like the glx api is designed to be used a single-threaded way : a glx context can be used only in one thread. It may be a problem with the fact it is using the xcb connection from Xlib. If firefox is using Xlib in another thread, it would lead to problems (I faced this bug when writing Present support for gallium nine). I'll soon send a patch to make it use a different xcb connection, with another small fix, and it may solve the bug.
Created attachment 115165 [details] Patch proposition This patch will make glx use a different xcb connection. I hope it solves the bug. I haven't checked it works yet, but it compiles.
(In reply to Axel Davy from comment #37) > Created attachment 115165 [details] > Patch proposition > > This patch will make glx use a different xcb connection. > > I hope it solves the bug. > I haven't checked it works yet, but it compiles. I'll test ASAP. Thanks for looking into it!
Tested this this configuration: - multi-core rendering enabled - in-game vsync enabled/disabled With both configuration and the patch applied, the issue doesn't surface anymore. Repeated toggling of the ingame menu, or scene changing / load sequences, stuff that previously quickly lead to the lockup, are now working. @Axel: Feel free to add my Tested-by.
With the patch applied on top of mesa 10.5.3 and DRI3 activated, libxcb segfaults when starting GDM here.
Due to extreme stupidness the entire content of my last post can be disregarded. I'm now going back into my corner of shame :(
Created attachment 115200 [details] [review] Second Patch Proposition New patch proposition, it should work this time.
(In reply to Axel Davy from comment #42) > Created attachment 115200 [details] [review] [review] > Second Patch Proposition > > New patch proposition, it should work this time. No more segfault and firefox works properly now. Nice job! Tested with UXA and SNA.
The proposed patch probably won't work if an app is rendering to a same drawable with several threads, which glx allows. It can be rewritten by creating two xcb connection per drawable (one to listen events, which would be protected by mutexes, so that only one thread can dispatch events, and one to send events). However the total number of X connection is apparently 255 for an entire server. Considering the number of applications using gl at the same time, this can't be a practical solution. Thus the bug must be solved at the root, that is libxcb and Xlib should be fixed, and we shouldn't workaround in mesa.
This bug affects Servo on Linux as well: https://github.com/servo/servo/issues/5789 Each painting thread on Servo draws to an XPixmap, which are then passed to the compositing thread which converts them to textures and paints them to the framebuffer. The proposed patch fixes it for me. Oddly, I run e10s + OMTC firefox, and I have never hit this bug in it. Are there any workarounds that can be used in the meantime?
BTW, this can be worked around by setting the environment variable LIBGL_DRI3_DISABLE=1 for the affected apps.
This doesn't work for me since I have to have some custom settings in my xorg.conf to correct panel DPI. This makes it impossible to load the second DDX so that PRIME over DRI2 works. So for me PRIME over DRI3 is the only way to use the dedicated GPU on this system.
Created attachment 115827 [details] Test that triggers the hang This is a test that triggers (sometimes - you may have to try several times) the event loss.
Created attachment 115887 [details] [review] XCB special event wakeup fixes This libxcb patch fixes the attached test case by Axel for me. Does it also help for the other use cases reported here (with Axel's previous Mesa patches reverted)?
I'm able to confirm that your libxcb patch fixes Servo, with my system Mesa reverted back to Fedora's unpatched version.
Fixes it for me as well.
Part of the patch was pushed: commit f85661c3bca97faa72431df92a3867be39a74e23 Author: Michel Dänzer <michel.daenzer@amd.com> Date: Mon Jun 1 11:04:18 2015 +0900 Call _xcb_wake_up_next_reader from xcb_wait_for_special_event All functions calling _xcb_conn_wait() must make sure that waiting readers are woken up when we read a reply or event that they are waiting for. xcb_wait_for_special_event() did not do so. This adds the missing call to_xcb_in_wake_up_next_reader(). Fixes deadlock when waiting for a special event and concurrently processing the display connection queue in another thread. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=84252 Tested-by: Thomas Daede <bztdlinux@gmail.com> Tested-by: Clément Guérin <geecko.dev@free.fr> Reviewed-by: Uli Schlachter <psychon@znc.in> Signed-off-by: Michel Dänzer <michel@daenzer.net> Signed-off-by: Uli Schlachter <psychon@znc.in> diff --git a/src/xcb_in.c b/src/xcb_in.c index 623a0a8..322bed8 100644 --- a/src/xcb_in.c +++ b/src/xcb_in.c @@ -761,6 +761,7 @@ xcb_generic_event_t *xcb_wait_for_special_event(xcb_connection_t *c, if(!_xcb_conn_wait(c, &se->special_event_cond, 0, 0)) break; + _xcb_in_wake_up_next_reader(c); pthread_mutex_unlock(&c->iolock); return event; }
Created attachment 116462 [details] Another test that runs into a different, but quite similar hang
Created attachment 116463 [details] [review] Make _xcb_in_wake_up_next_reader() wake up threads in xcb_wait_for_special_event() Can someone check if this patch for libxcb also fixes the bug? This replaces half of Michel's patch, so you would still need this part from that other patch (this part is already committed to libxcb's git master, so you could also just test master + this patch): diff --git a/src/xcb_in.c b/src/xcb_in.c index 623a0a8..2ce5a56 100644 --- a/src/xcb_in.c +++ b/src/xcb_in.c @@ -761,6 +761,7 @@ xcb_generic_event_t *xcb_wait_for_special_event(xcb_connection_t *c, if(!_xcb_conn_wait(c, &se->special_event_cond, 0, 0)) break; + _xcb_in_wake_up_next_reader(c); pthread_mutex_unlock(&c->iolock); return event; } Thanks for testing
Fwiw, seeing what appear to be related deadlocks in plasma5 environment too, see downstream bug: https://bugzilla.redhat.com/show_bug.cgi?id=1193742 and the proposed fix/workaround from comment #49 seems to help.
A second part of this was pushed as commit 5b40681c887192307f3ae147d2158870aa79c05f. This should hopefully fix this.
I've a downstream user https://bugzilla.redhat.com/show_bug.cgi?id=1193742#c45 that recently experienced another deadlock (in plasmashell), using the latest commits referenced in this report. #0 0x00000035130f666d in poll () at ../sysdeps/unix/syscall-template.S:81 #1 0x0000003d9c60a182 in _xcb_conn_wait (__timeout=-1, __nfds=1, __fds=0x7ffe1eecb730) at /usr/include/bits/poll2.h:46 #2 0x0000003d9c60a182 in _xcb_conn_wait (c=c@entry=0x1007ff0, cond=cond@entry=0x58bc8f8, vector=vector@entry=0x0, count=count@entry=0x0) at xcb_conn.c:459 #3 0x0000003d9c60bd49 in xcb_wait_for_special_event (c=c@entry=0x1007ff0, se=0x58bc8d0) at xcb_in.c:744 #4 0x0000003708e49e72 in dri3_find_back (c=c@entry=0x1007ff0, priv=priv@entry=0x87cd3a0) at dri3_glx.c:1263 #5 0x0000003708e4ac8f in dri3_get_buffers (driDrawable=0x87cd3a0, loaderPrivate=0x87cd3a0, buffer_type=dri3_buffer_back, format=4107) at dri3_glx.c:1289 #6 0x0000003708e4ac8f in dri3_get_buffers (driDrawable=driDrawable@entry=0x463e0a0, format=4107, stamp=stamp@entry=0x463e0d0, loaderPrivate=loaderPrivate@entry=0x87cd3a0, buffer_mask=<optimized out>, buffer_mask@entry=1, buffers=buffers@entry=0x7ffe1eecb980) at dri3_glx.c:1466 #7 0x00007f09f73ae14b in intel_update_renderbuffers (drawable=0x463e0a0, brw=0x2b06018) at brw_context.c:1527 #8 0x00007f09f73ae14b in intel_update_renderbuffers (context=context@entry=0x2af37b0, drawable=drawable@entry=0x463e0a0) at brw_context.c:1242 #9 0x00007f09f73ae4a1 in intel_prepare_render (brw=brw@entry=0x2b06018) at brw_context.c:1263 #10 0x00007f09f73a21c0 in brw_clear (ctx=0x2b06018, mask=50) at brw_clear.c:234 #11 0x0000003715f28bfa in QSGBatchRenderer::Renderer::renderBatches() (this=this@entry=0x8730b40) Is this the same issue or should we open a separate report?
(In reply to Rex Dieter from comment #57) > Is this the same issue or should we open a separate report? Please file a separate report. Note that this could be caused by a bug in the X server / driver.
I am having this issue in Manjaro (Archlinux based) with Plasma 5. I can reproduce the issue easily with xset dpms force off; sleep 2; kdialog --passivepopup foo 5 Changing to DRI2 fixes the issue. Should I file a separate bug report?
Created attachment 121283 [details] attachment-29921-0.html ?I'm out of the office due to illness. I hope to check e-mail from time to time but responses will be delayed. Regards, Tim
(In reply to Nikola Šnele from comment #59) > Should I file a separate bug report? Assuming your libxcb has the fix referenced in this report, yes please.
(In reply to Nikola Šnele from comment #59) > I am having this issue in Manjaro (Archlinux based) with Plasma 5. > > I can reproduce the issue easily with > xset dpms force off; sleep 2; kdialog --passivepopup foo 5 > > Changing to DRI2 fixes the issue. > > Should I file a separate bug report? I also have found this issue with notify-send, but it did not work as reliably. However, does it work on a fresh XServer for you? For me, I have to wait multiple days before I can reproduce this issue. I made an xtrace ... but failed to store it thinking I would just test first but then, plasmashell did not want to start again and I had to kill X to get it to work again. I will have to wait a few more days to get this one. (In reply to Michel Dänzer from comment #61) > (In reply to Nikola Šnele from comment #59) > > Should I file a separate bug report? > > Assuming your libxcb has the fix referenced in this report, yes please. Yes, I can reproduce this bug with a fixed libxcb. I will create a new bug report as soon as I get the xtrace.
@Martin Peres Yes I have to wait day or two of uptime to get hit by the bug. On fresh boot I cannot reproduce it. But when freeze does happen I can easily reproduce it everytime until next restart/boot up. When freeze does happen, I have to kill plasmashell and run it again to unfreeze it, and then I can reproduce the bug with xset dpms force off; sleep 2; kdialog --passivepopup foo 5 or sending ping notification from kdeconnect (any notification when display is in dpms off state will freeze plasmashell). Can you tell me how to get xtrace (backtrace)? I am noob about this stuff.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.