Bug 74752

Summary: Weston-desktop-shell hangs.
Product: Wayland Reporter: nerdopolis1
Component: westonAssignee: Wayland bug list <wayland-bugs>
Status: RESOLVED NOTOURBUG QA Contact:
Severity: normal    
Priority: medium CC: pochu27
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: backtrace of a hung weston-desktop-shell
bt full of a hung weston-desktop-shell

Description nerdopolis1 2014-02-09 16:33:08 UTC
Hi. It seems weston-desktop-shell is hanging. It seems to hang even with no other clients at one point

This does not seem to happen on the FBDEV backend, but it does happen on DRM and X11 backends. Most of my testing of Weston is on FBDEV, so I can't be sure when exactly it was introduced.

But I do get this output:
[3156457.465] wl_pointer@24.motion(393998481, 490.000000, 46.000000)
[3156468.469] wl_pointer@24.motion(393998492, 490.000000, 42.000000)
[3156468.555] wl_pointer@24.motion(393998492, 491.000000, 39.000000)
[3156479.226] wl_pointer@24.motion(393998503, 491.000000, 36.000000)
[3156489.740] wl_pointer@24.motion(393998513, 493.000000, 33.000000)
[3156500.615] wl_pointer@24.leave(334, wl_surface@22)
[3156500.661] wl_keyboard@25.modifiers(335, 0, 0, 0, 0)
[3156500.697] wl_pointer@24.enter(335, wl_surface@21, 493.000000, 30.000000)
[3156500.722]  -> wl_surface@9.attach(wl_buffer@27, 0, 0)
[3156500.738]  -> wl_surface@9.damage(0, 0, 24, 24)
[3156500.755]  -> wl_surface@9.commit()
[3156500.762]  -> wl_pointer@24.set_cursor(335, wl_surface@9, 5, 0)
[3156500.779] wl_pointer@24.motion(393998524, 493.000000, 30.000000)
[3156500.808] wl_pointer@24.motion(393998524, 493.000000, 27.000000)
[3156500.830]  -> wl_surface@21.frame(new id wl_callback@34)
*** Error in `/opt/libexec/weston-desktop-shell': malloc(): smallbin double linked list corrupted: 0x091b2a70 ***
Comment 1 U. Artie Eoff 2014-02-10 01:56:03 UTC
How is this reproduced?  What version (better, commit ids)?
Comment 2 nerdopolis1 2014-02-10 02:03:03 UTC
This is weston commit dfaf65ba1636e49b850adff34f31de00b5f06bba
Comment 3 U. Artie Eoff 2014-02-10 02:22:06 UTC
(In reply to comment #2)
> This is weston commit dfaf65ba1636e49b850adff34f31de00b5f06bba

Hmmm... seems to work fine for me on the following stack:

wayland (master) heads/master-0-ga18e344
drm (master) heads/master-0-g128e74c
mesa (master) heads/master-0-g5125165
libva (master) heads/master-0-gb4a4f9b
intel-driver (master) heads/master-0-g54cb60f
weston (master) heads/master-0-gdfaf65b

and tested on Intel Ivybridge, Fedora 20, X11 backend and desktop-shell using cairo-glesv2 (v1.12.14).
Comment 4 nerdopolis1 2014-02-10 03:59:45 UTC
I'll try rebuilding Mesa...
Comment 5 Emilio Pozuelo Monfort 2014-02-10 07:58:28 UTC
Can you run weston-desktop-shell under valgrind so we see where the error comes from?
Comment 6 nerdopolis1 2014-02-10 12:10:31 UTC
I had it set to follow all pids when I called Weston.

I am seeing this: but under valgrind, it doesn't hang, so I'm not sure if this is the same issue

==345== Invalid read of size 4
==345==    at 0x4221D48: _cairo_gl_surface_resolve_multisampling (cairo-gl-surface.c:1311)
==345==    by 0x8: ???
==345==  Address 0x4c46610 is 8 bytes before a block of size 80 alloc'd
==345==    at 0x402B965: calloc (in /usr/lib/valgrind/vgpreload_memcheck-x86-linux.so)
==345==    by 0x50BAF5A: ralloc_size (ralloc.c:113)
==345==    by 0x50BAFD4: rzalloc_size (ralloc.c:134)
==345==    by 0x4F1E8C0: _mesa_hash_table_rehash (hash_table.c:223)
==345==    by 0x4F1EA89: _mesa_hash_table_insert (hash_table.c:261)
==345==    by 0x4F1E139: _mesa_HashInsert (hash.c:226)
==345==    by 0x4F72048: _mesa_GenTextures (texobj.c:1029)
==345==    by 0x4221503: _create_scratch_internal (cairo-gl-surface.c:454)
==345==    by 0x422161F: _cairo_gl_surface_create_and_clear_scratch (cairo-gl-surface.c:509)
==345==    by 0x42218BA: cairo_gl_surface_create (cairo-gl-surface.c:612)
==345==    by 0x4218C7C: _cairo_gl_composite_glyphs_with_clip (cairo-gl-glyphs.c:365)
==345==    by 0x4218F1E: _cairo_gl_composite_glyphs (cairo-gl-glyphs.c:482)
==345== 
==345== Conditional jump or move depends on uninitialised value(s)
==345==    at 0x421522E: _cairo_gl_context_setup_operand (cairo-gl-composite.c:225)
==345==    by 0x4215756: _cairo_gl_set_operands_and_operator (cairo-gl-composite.c:724)
==345==    by 0x42159CF: _cairo_gl_composite_begin (cairo-gl-composite.c:760)
==345==    by 0x421EA7E: composite_boxes (cairo-gl-spans-compositor.c:409)
==345==    by 0x41C6CE5: clip_and_composite_boxes.part.10 (cairo-spans-compositor.c:683)
==345==    by 0x41C71F5: clip_and_composite_boxes (cairo-spans-compositor.c:901)
==345==    by 0x1: ???
==345==
Comment 7 nerdopolis1 2014-02-10 23:18:01 UTC
I forgot to mention, I am compiling Weston with --with-cairo=gl
It's been a while since I last looked at my weston build script.
Comment 8 Pekka Paalanen 2014-02-11 06:35:39 UTC
(In reply to comment #7)
> I forgot to mention, I am compiling Weston with --with-cairo=gl

Ookay, that's pretty rare I think. Do wayland clients use egl_dri2 or egl_gallium in your system?

Fbdev backend does not initialize server-side EGL, so it does not advertise wl_drm. DRM and x11 backends do. This means that on client side, weston-desktop-shell on fbdev-compositor will either complain and fall back to wl_shm, or it will use egl_gallium with a software renderer, depending on your Mesa build. On the other compositors, weston-desktop-shell will attempt to use wl_drm if advertized, which means you hit either egl_dri2 or egl_gallium, again depending on your Mesa build.

What is your gfx card flavour, intel, nouveau or radeon?

I believe the untested and mostly unmaintained combination is egl_gallium with nouveau and radeon drivers (with wl_drm), so problems are not unexpected. Do you hit this case?

Unfortunately egl_gallium is atm the only way to use software rendered GL, and Mesa prefers egl_gallium over egl_dri2. You could override that with EGL_DRIVER env var.
Comment 9 nerdopolis1 2014-02-11 13:10:40 UTC
Hi.

This is an Intel card.

I build my mesa with
./autogen.sh --prefix=$INSTALLDIR --enable-driglx-direct --enable-dri --with-dri-drivers=r200,radeon,nouveau,i915,i965,swrast --enable-osmesa --enable-xa --enable-glx-tls --enable-shared-dricore --enable-gles2  --with-gallium-drivers=nouveau,svga,r300,r600,swrast,radeonsi,ilo  --with-egl-platforms=x11,wayland,drm --enable-gbm --enable-shared-glapi --enable-gallium-egl --with-llvm-prefix=/usr/lib/llvm-3.4/ --disable-dri3 --with-llvm-shared-libs --libdir=$INSTALLDIR/lib/$(dpkg-architecture -qDEB_HOST_MULTIARCH) 

The reason why I enable so much many options is because this is a Live CD distribution.

I'll try EGL_DRIVER=egl_gallium, but I'll also see how this works on vbox...
Comment 10 nerdopolis1 2014-02-11 23:27:15 UTC
Created attachment 93890 [details]
backtrace of a hung weston-desktop-shell

Weston built with --with-cairo=gl
Comment 11 nerdopolis1 2014-02-13 04:11:32 UTC
I am now getting this without the --with-cairo=gl, and setting it to --with-cairo=image. I am still getting the hang, unless I call it with --use-pixman, if these details help
Comment 12 Pekka Paalanen 2014-02-13 07:33:28 UTC
(In reply to comment #10)
> Created attachment 93890 [details]
> backtrace of a hung weston-desktop-shell
> 
> Weston built with --with-cairo=gl

That... does not look like any specific problem, it looks like memory corruption, because it detects an error inside malloc() and then hangs trying to report it which involves some init-once dlopening a library hitting a deadlock on a mutex. Something that I would describe as "wtf".

Unfortunately this doesn't tell much. But since you say it does not happen when ran under Valgrind, it might involve a race which leads to e.g. use of freed memory or whatever corrupting memory.

Gaah...
Comment 13 nerdopolis1 2014-02-13 14:47:27 UTC
Created attachment 94003 [details]
bt full of a hung weston-desktop-shell

Does a bt full trace help?
Comment 14 U. Artie Eoff 2014-02-13 16:30:27 UTC
Would it be worth trying to see if this happens when using Mesa 10.0?  We are seeing another regression caused by Mesa master in bug 74689, which is why I ask.
Comment 15 nerdopolis1 2014-02-14 02:24:46 UTC
Hi.

Sorry about the delay.
I tested with Mesa 10.0, and it seems like I am NOT getting the hang.
Comment 16 U. Artie Eoff 2014-02-14 03:14:02 UTC
(In reply to comment #15)
> Hi.
> 
> Sorry about the delay.
> I tested with Mesa 10.0, and it seems like I am NOT getting the hang.

ok, perhaps this issue you're seeing is caused by the Mesa commit mentioned in bug 74689#c5, too.
Comment 17 U. Artie Eoff 2014-02-14 03:16:10 UTC
(In reply to comment #16)
> (In reply to comment #15)
> > Hi.
> > 
> > Sorry about the delay.
> > I tested with Mesa 10.0, and it seems like I am NOT getting the hang.
> 
> ok, perhaps this issue you're seeing is caused by the Mesa commit mentioned
> in bug 74689#c5, too.

That is, this one: http://cgit.freedesktop.org/mesa/mesa/commit/?id=11baad35088dfd4bdabc1710df650
Comment 18 nerdopolis1 2014-02-14 12:50:48 UTC
I tried building mesa master, runnning
git revert 11baad35088dfd4bdabc1710df650 -n
It still seems to hang
Comment 19 nerdopolis1 2014-05-01 22:58:18 UTC
As it turns out, I had ilo enabled in my mesa. This is what was causing the hang

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.