Bug 93414

Summary: hang in xcb_wait_for_reply / vl_dri2_get_flush_reply
Product: XCB Reporter: Vasilis LIaskovitis <vliaskovitis>
Component: LibraryAssignee: xcb mailing list dummy <xcb>
Status: RESOLVED FIXED QA Contact: xcb mailing list dummy <xcb>
Severity: normal    
Priority: medium    
Version: 1.11   
Hardware: Other   
OS: All   
See Also: https://bugs.freedesktop.org/show_bug.cgi?id=93412
Whiteboard:
i915 platform: i915 features:

Description Vasilis LIaskovitis 2015-12-16 21:33:38 UTC
Application that renders using libva/vdpau/opengl hangs after a few minutes of operation (always reproducible within 1-15 minutes usually).
 
I see a hang in xcb_wait_for_reply() for request #31161:
 
Possibly relevant part of xtrace:
000:>:4d8a:32: Reply to SwapBuffers: swap_hi=0 swap_lo=31161
000:>:4d8b:32: Reply to GetInputFocus: revert-to=Parent(0x02) focus=0x02e000e0
000:<:24d8c: 20: DRI2-Request(155,7): GetBuffersWithFormat drawable=0x04400006 attachments={attachment=BackLeft(0x00000001) format=0x00000020};
000:>:24d8c: Event DRI2-BufferSwapComplete(102) drawable=0x00000002 ust_hi=71303174 ust_lo=0 msc_hi=1858207713 msc_lo=0 sbc_hi=109393 sbc_lo=31161
 
And:
000:<:31161:  4: Request(43): GetInputFocus  
000:>:31161: Event DRI2-InvalidateBuffers(103) drawable=0x04400006
 
Backtrace of thread waiting in xcb_wait_for_reply():
 
Thread 8 (Thread 0x7fca1f778700 (LWP 3580)):
#0  0x00007fca55a3e12d in poll () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007fca5316b262 in poll (__timeout=-1, __nfds=1, __fds=0x7fca1f776a20) at
    /usr/include/x86_64-linux-gnu/bits/poll2.h:46
#2  _xcb_conn_wait (c=c@entry=0x2885500, cond=cond@entry=0x7fca1f776b40,
        vector=vector@entry=0x0, count=count@entry=0x0) at xcb_conn.c:459
#3  0x00007fca5316cadf in wait_for_reply (c=c@entry=0x2885500, request=31161,
        e=e@entry=0x0) at xcb_in.c:491
#4  0x00007fca5316cbf2 in xcb_wait_for_reply (c=0x2885500, request=31161, e=0x0)
at xcb_in.c:521
#5  0x00007fca419b8004 in vl_dri2_get_flush_reply (scrn=0x2653600) at
../../../../../src/gallium/auxiliary/vl/vl_winsys_dri.c:110
#6  0x00007fca419b8101 in vl_dri2_destroy_drawable (scrn=0x2653600) at
../../../../../src/gallium/auxiliary/vl/vl_winsys_dri.c:153
#7  0x00007fca419b8185 in vl_dri2_set_drawable (scrn=0x2653600,
        drawable=71303307) at
../../../../../src/gallium/auxiliary/vl/vl_winsys_dri.c:176
#8  0x00007fca419b82c8 in vl_dri2_screen_texture_from_drawable
(vscreen=0x2653600, drawable=0x440008b) at
../../../../../src/gallium/auxiliary/vl/vl_winsys_dri.c:193
#9  0x00007fca419b61ab in vlVdpPresentationQueueDisplay
(presentation_queue=<optimized out>, surface=5, clip_width=1280,
 clip_height=720, earliest_presentation_time=0)
        at ../../../../../../src/gallium/state_trackers/vdpau/presentation.c:234
#10 0x00007fca423ca6a3 in flip_surface_unlocked
        (driver_data=driver_data@entry=0x2650c90,
         obj_output=obj_output@entry=0x291a2b0) at vdpau_video_x11.c:583
#11 0x00007fca423cb2e1 in queue_surface_unlocked (obj_surface=<optimized out>,
        obj_output=0x291a2b0, driver_data=0x2650c90) at vdpau_video_x11.c:610
#12 put_surface_unlocked (flags=32, target_rect=0x7fca1f776f90,
        source_rect=0x7fca1f776f80, obj_output=0x291a2b0, obj_surface=0x2919bf0,
        driver_data=0x2650c90) at vdpau_video_x11.c:689
#13 put_surface (driver_data=driver_data@entry=0x2650c90,
        surface=surface@entry=50331655, drawable=drawable@entry=71303307,
        drawable_width=<optimized out>, drawable_height=720,
            source_rect=source_rect@entry=0x7fca1f776f80,
            target_rect=target_rect@entry=0x7fca1f776f90, flags=flags@entry=32)
at vdpau_video_x11.c:755
#14 0x00007fca423cb59c in vdpau_PutSurface (ctx=<optimized out>,
        surface=50331655, draw=0x440008b, srcx=<optimized out>, srcy=<optimized
        out>, srcw=<optimized out>, srch=720, destx=0, desty=0,
            destw=1280, desth=720, cliprects=0x0, number_cliprects=0, flags=32)
at vdpau_video_x11.c:808
#15 0x000000000040eed2 in on_client_draw (glsink=<optimized out>, buf=<optimized
        out>, data=0x626d40 <video>) at video.c:619


Full backtrace at: http://pastebin.com/rQCJm64C

Using:

libX11 1.6.2
libxcb 1.11 or 1.10
mesa 11.0 (11.0~git1508070730.42d283~gd~t from oibaf/graphics-drivers ppa)
xserver-xorg-video-radeon 1:7.5.99+git1508070731.3791fc~gd~t (same ppa)


I also made same bug  under dri/gallium/radeon_si https://bugs.freedesktop.org/show_bug.cgi?id=93412 , apologies for this, I am not sure where to file this problem. I will close one of the two bugs whenI am pointed to the most appropriate component of the two.
Comment 1 Michel Dänzer 2015-12-17 06:49:26 UTC
First of all, does the app call XInitThreads before any other libX11 functions are called (on any thread)?

If so, any chance you can try libxcb 1.11.1 or a current Git snapshot?
Comment 2 Vasilis LIaskovitis 2015-12-17 14:31:09 UTC
Yes, XInitThreads() is called at startuo (main function) before any other libX11 functions are called from any threads of the app.

Problem still exists with libxcb 1.11.1 (I am buiding and installing it manually though. Does newer Xlib also need to be built, or is xcb sufficient?)
Comment 3 Uli Schlachter 2015-12-19 10:45:52 UTC
(In reply to Vasilis LIaskovitis from comment #0)
[...]
> I see a hang in xcb_wait_for_reply() for request #31161:
>  
> Possibly relevant part of xtrace:
> 000:>:4d8a:32: Reply to SwapBuffers: swap_hi=0 swap_lo=31161
> 000:>:4d8b:32: Reply to GetInputFocus: revert-to=Parent(0x02)
> focus=0x02e000e0
> 000:<:24d8c: 20: DRI2-Request(155,7): GetBuffersWithFormat
> drawable=0x04400006 attachments={attachment=BackLeft(0x00000001)
> format=0x00000020};
> 000:>:24d8c: Event DRI2-BufferSwapComplete(102) drawable=0x00000002
> ust_hi=71303174 ust_lo=0 msc_hi=1858207713 msc_lo=0 sbc_hi=109393
> sbc_lo=31161

Why is this part relevant? It's 0x31161-0x24d8c = 50133 requests before the hang, so apparently things still work for 50k requests. That should be a long time.

> And:
> 000:<:31161:  4: Request(43): GetInputFocus  
> 000:>:31161: Event DRI2-InvalidateBuffers(103) drawable=0x04400006

So if I understand this correctly, you are saying that requests #31161 is a GetInputFocus request and xtrace never saw a reply to it, right?
Isn't the bug then in the server for not replying to GetInputFocus?

> Full backtrace at: http://pastebin.com/rQCJm64C

Thread 16 is in radeon_something, waits on a condition
Thread 15 waits for the vaapi lock(?) (pthread_mutex_lock() from within gst_vaapisink_x11_handle_events())
Thread 14 waits on a condition inside gst_data_queue_push() (hm?!? Is some queue full, or why does it wait?)
Thread 13 waits on a condition inside gst_task_func (dunno what this is)
Thread 12 waits on a condition inside gst_vaapidecode_handle_frame() (and has a deep backtrace with lots of gst stuff)
Thread 11 waits in gst_queue_chain_buffer_or_list() (deep backtrace)
Thread 10 waits on a condition inside gst_queue_loop()
Thread 9 waits on a condition inside gst_base_sink_wait_preroll() (deep backtrace)
Thread 8 poll()s for a reply to request 31161 (called from vl_dri2_get_flush_reply()) (deep backtrace)
Thread 7 waits on a condition inside ??() in libavcodec.so
So do thread 6, 5, 4, 3.
Thread 2 poll()s something pulseaudio related (pa_mainloop_poll())
Thread 1 *also* poll()s the xcb connection (wait_for_reply() on request 212481, called from _XReply(), called from DRI2GetBuffersWithFormat, called eventually from st_Clear() (something from mesa), called from main())

Ohhh... thread 1 and thread 8 wait on different XCB connections. I have no clue about the internals of mesa etc, but could it be that something did a GrabServer and that's why the GetInputFocus (and DRI2GetBuffersWithFormat) isn't replied to? Since each thread involving libxcb has its own connection, I really don't think that the deadlock is inside of libxcb.

@Vasilis: How do you know that this is a deadlock? Does your X11 server still work correctly otherwise? I'd expect both of these pending requests to get replies.

Also, could you do something like this ('thread 1' would switch to, well, thread 1 and frame 2 to the frame with `_xcb_conn_wait()` so that we get a reference to the XCB connection, the various prints print the state of the XCB connection):
thread 1
frame 2
print c
print c.in->current_reply
print c.in->events
print c.in->pending_replies
Comment 4 Vasilis LIaskovitis 2015-12-21 10:23:58 UTC
(In reply to Uli Schlachter from comment #3)
> (In reply to Vasilis LIaskovitis from comment #0)
> [...]
> > I see a hang in xcb_wait_for_reply() for request #31161:
> >  
> > Possibly relevant part of xtrace:
> > 000:>:4d8a:32: Reply to SwapBuffers: swap_hi=0 swap_lo=31161
> > 000:>:4d8b:32: Reply to GetInputFocus: revert-to=Parent(0x02)
> > focus=0x02e000e0
> > 000:<:24d8c: 20: DRI2-Request(155,7): GetBuffersWithFormat
> > drawable=0x04400006 attachments={attachment=BackLeft(0x00000001)
> > format=0x00000020};
> > 000:>:24d8c: Event DRI2-BufferSwapComplete(102) drawable=0x00000002
> > ust_hi=71303174 ust_lo=0 msc_hi=1858207713 msc_lo=0 sbc_hi=109393
> > sbc_lo=31161
> 
> Why is this part relevant? It's 0x31161-0x24d8c = 50133 requests before the
> hang, so apparently things still work for 50k requests. That should be a
> long time.
> 

You are right, this looks irrelevant.. not very familiar xith X protocol, I was thinking sbc_lo could refer to a request ID.

> > And:
> > 000:<:31161:  4: Request(43): GetInputFocus  
> > 000:>:31161: Event DRI2-InvalidateBuffers(103) drawable=0x04400006
> 
> So if I understand this correctly, you are saying that requests #31161 is a
> GetInputFocus request and xtrace never saw a reply to it, right?
> Isn't the bug then in the server for not replying to GetInputFocus?
> 
> > Full backtrace at: http://pastebin.com/rQCJm64C
> 
> Thread 16 is in radeon_something, waits on a condition
> Thread 15 waits for the vaapi lock(?) (pthread_mutex_lock() from within
> gst_vaapisink_x11_handle_events())
> Thread 14 waits on a condition inside gst_data_queue_push() (hm?!? Is some
> queue full, or why does it wait?)
> Thread 13 waits on a condition inside gst_task_func (dunno what this is)
> Thread 12 waits on a condition inside gst_vaapidecode_handle_frame() (and
> has a deep backtrace with lots of gst stuff)
> Thread 11 waits in gst_queue_chain_buffer_or_list() (deep backtrace)
> Thread 10 waits on a condition inside gst_queue_loop()
> Thread 9 waits on a condition inside gst_base_sink_wait_preroll() (deep
> backtrace)
> Thread 8 poll()s for a reply to request 31161 (called from
> vl_dri2_get_flush_reply()) (deep backtrace)
> Thread 7 waits on a condition inside ??() in libavcodec.so
> So do thread 6, 5, 4, 3.
> Thread 2 poll()s something pulseaudio related (pa_mainloop_poll())
> Thread 1 *also* poll()s the xcb connection (wait_for_reply() on request
> 212481, called from _XReply(), called from DRI2GetBuffersWithFormat, called
> eventually from st_Clear() (something from mesa), called from main())
> 
> Ohhh... thread 1 and thread 8 wait on different XCB connections. I have no
> clue about the internals of mesa etc, but could it be that something did a
> GrabServer and that's why the GetInputFocus (and DRI2GetBuffersWithFormat)
> isn't replied to? Since each thread involving libxcb has its own connection,
> I really don't think that the deadlock is inside of libxcb.
> 
> @Vasilis: How do you know that this is a deadlock? Does your X11 server
> still work correctly otherwise? I'd expect both of these pending requests to
> get replies.
> 
> Also, could you do something like this ('thread 1' would switch to, well,
> thread 1 and frame 2 to the frame with `_xcb_conn_wait()` so that we get a
> reference to the XCB connection, the various prints print the state of the
> XCB connection):
> thread 1
> frame 2
> print c
> print c.in->current_reply
> print c.in->events
> print c.in->pending_replies

pending_replies is 0

(gdb) print c.in->current_reply
$4 = (struct reply_list *) 0x0
(gdb) print c.in->events
$5 = (struct event_list *) 0x7f0b1c008730
(gdb) print c.in->pending_replies
$6 = (struct pending_reply *) 0x0

Looking at all threads in bt, this may be a completely different issue with gstreamer queues.. we can maybe close the issue and reopen if I find something that really proves it is an xcb/xserver issue (or if you have another idea)
Comment 5 Uli Schlachter 2015-12-21 10:34:22 UTC
Just an explanation for that debug info:
There are no replies that XCB already received, but did not return to any caller. Some events were received, but not yet returned to the application (nothing called xcb_poll/wait_for_event()). The empty "pending_replies" list means that no answer to checked requests is pending (hm? Is this the thread where libX11 owns the socket?).

Besides that, I don't know. I'd blame the X server for not answering and the non-X11 threads don't tell me anything.
Comment 6 Sergey Alirzaev 2018-09-23 20:53:47 UTC
Seems like i'm hit by the same bug. In my case, it hangs in OpenGL games (glxgears seems to be unaffected) randomly when i switch workspaces. With StarRuler2 i only have one thread hanging while communicating with X, and here's the backtrace:

		#0  0x00007ffff5d05db3 in poll () from /lib64/libc.so.6
		#1  0x00007ffff55769e2 in _xcb_conn_wait () from /usr/lib64/libxcb.so.1
		#2  0x00007ffff55789b7 in wait_for_reply () from /usr/lib64/libxcb.so.1
		#3  0x00007ffff5578b71 in xcb_wait_for_reply64 () from /usr/lib64/libxcb.so.1
		#4  0x00007ffff7a3e288 in _XReply () from /usr/lib64/libX11.so.6
		#5  0x00007ffff7f891c7 in DRI2GetBuffersWithFormat (dpy=0x555555a6c410, drawable=163577862, width=width@entry=0x555555e20a28, height=height@entry=0x555555e20a2c, attachments=0x7fffffffd620, count=1, outCount=0x7fffffffd69c)
		    at ../mesa-18.2.0/src/glx/dri2.c:485
		#6  0x00007ffff7f89596 in dri2GetBuffersWithFormat (driDrawable=<optimized out>, width=0x555555e20a28, height=0x555555e20a2c, attachments=<optimized out>, count=<optimized out>, out_count=0x7fffffffd69c, loaderPrivate=0x555555e20c80)
		    at ../mesa-18.2.0/src/glx/dri2_glx.c:900
		#7  0x00007fffeb3b57ef in intel_query_dri2_buffers (brw=brw@entry=0x555555c6c1b0, drawable=drawable@entry=0x555555e20a00, buffers=buffers@entry=0x7fffffffd6a0, buffer_count=buffer_count@entry=0x7fffffffd69c)
		    at ../mesa-18.2.0/src/mesa/drivers/dri/i965/brw_context.c:1531
		#8  0x00007fffeb3b592f in intel_update_dri2_buffers (brw=brw@entry=0x555555c6c1b0, drawable=drawable@entry=0x555555e20a00) at ../mesa-18.2.0/src/mesa/drivers/dri/i965/brw_context.c:1370
		#9  0x00007fffeb3b5b3d in intel_update_renderbuffers (context=context@entry=0x555555c660a0, drawable=drawable@entry=0x555555e20a00) at ../mesa-18.2.0/src/mesa/drivers/dri/i965/brw_context.c:1426
		#10 0x00007fffeb3b5baf in intel_prepare_render (brw=brw@entry=0x555555c6c1b0) at ../mesa-18.2.0/src/mesa/drivers/dri/i965/brw_context.c:1445
		#11 0x00007fffeb3b079e in brw_clear (ctx=0x555555c6c1b0, mask=272) at ../mesa-18.2.0/src/mesa/drivers/dri/i965/brw_clear.c:254
		#12 0x00007fffeb5b1fe7 in clear (no_error=false, mask=16640, ctx=0x555555c6c1b0) at ../mesa-18.2.0/src/mesa/main/clear.c:221
		#13 _mesa_Clear (mask=<optimized out>) at ../mesa-18.2.0/src/mesa/main/clear.c:242
		#14 0x00005555557d77c9 in render::GLDriver::setRenderTarget(render::Texture*, bool) ()
		#15 0x000055555582ecdb in renderFrame(scripts::Manager*, scripts::Manager*) ()
		#16 0x0000555555813eed in tickMenu() ()
		#17 0x00005555555a842c in main ()

Can i do anything about it?
Comment 7 Sergey Alirzaev 2018-09-24 05:48:34 UTC
Seems like it is fixed if i build xf86-video-intel with --with-default-dri=3
Comment 8 Daniel Stone 2019-02-16 12:40:01 UTC
This sounds like the issue is that DRI2 is broken in the X server itself, which I believe was fixed last year.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.