Bug 105507

Summary: Crash when destroying a newly resized EGLsurface with wayland egl (dri2)
Product: Mesa Reporter: Johan Helsing <johan.helsing>
Component: EGL/WaylandAssignee: Wayland bug list <wayland-bugs>
Status: RESOLVED MOVED QA Contact: mesa-dev
Severity: normal    
Priority: medium CC: daniel, johan.helsing
Version: 17.3   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:

Description Johan Helsing 2018-03-14 14:54:23 UTC
In dri2_wl_surface_release_buffers, a wl_buffer is not destroyed if it's locked. Afterwards it's set to null regardless (dri2_surf->color_buffers[i].wl_buffer = NULL;)

Normally, this is fine, since the buffer will be released by the wl_buffer_release event when the compositor is done with it. But if the EGLSurface is destroyed first, then the event queue for the surface (and for the wl_buffer) is destroyed, and the wl_release event then causes a crash because we try to use a destroyed event queue.

One solution would be to maintain a separate list of buffers we tried to destroy, but couldn't because they were locked. And make sure they are destroyed in dri2_wl_destroy_surface.

This might not be a problem users frequently run into, but it's causing many unit tests in Qt to be flaky, and we probably have to blacklist them until this is fixed (https://bugreports.qt.io/browse/QTBUG-66848)
Comment 1 Emil Velikov 2018-03-14 17:55:38 UTC
Daniel, any suggestions?
Comment 2 Daniel Stone 2018-03-14 18:03:11 UTC
You could place any orphaned wl_buffers on an per-surface list instead, and spin at destruction until that emptied. I won't have the time to look into it myself for a while though.

Johan - which test hits this?
Comment 3 Johan Helsing 2018-03-15 08:18:36 UTC
I initially thought it was all the tests in tst_QOpenGLWindow, but it's only tst_QOpenGLWindow::basic. However, it will soon to be split into tst_QOpenGLWindow::resize if https://codereview.qt-project.org/#/c/223235/ goes in. The test is located in qtbase/tests/auto/gui/kernel/qopenglwindow

Note that when you run ./tst_QOpenGLWindow, the crash will be reported as a crash in one of the subsequent tests when the release event is received. The easiest way to trigger it would be to run the test like this:

$ ./tst_QOpenGLWindow basic basic basic basic basic basic basic basic basic basic basic basic

(to make sure the test client doesn't disconnect before the release event is received)
Comment 4 Pekka Paalanen 2018-03-15 08:27:41 UTC
(In reply to Daniel Stone from comment #2)
> You could place any orphaned wl_buffers on an per-surface list instead, and
> spin at destruction until that emptied.

What would guarantee that the last committed wl_buffer would actually get released at all?

Destroying the EGLSurface cannot destroy the wl_surface, because the wl_surface is caller-owned, isn't it?
Comment 5 Daniel Stone 2018-03-15 08:48:38 UTC
(In reply to Pekka Paalanen from comment #4)
> (In reply to Daniel Stone from comment #2)
> > You could place any orphaned wl_buffers on an per-surface list instead, and
> > spin at destruction until that emptied.
> 
> What would guarantee that the last committed wl_buffer would actually get
> released at all?
> 
> Destroying the EGLSurface cannot destroy the wl_surface, because the
> wl_surface is caller-owned, isn't it?

Correct, so the only correct (though incredibly counter-intuitive) course of action is to destroy the wl_surface before destroying the EGLSurface. Alternately, destroy the wl_surface role object in such a way which guarantees it's unmapped.

The fact people don't actually do this is why Mesa just immediately destroys _all_ buffers - released or not - when the EGLSurface is destroyed. If the buffer hasn't been released, you might catch a client error.

That being said, I was suggesting 'orphaned' wl_buffers: i.e. where they are still marked busy, but have been discarded from the surface list due to a resize request. This happens on the first draw call after a surface resize, so the only point which a buffer could be so orphaned and not have a release scheduled is if you did: wl_egl_window_resize(); glClear(); eglDestroySurface(); without swapping. Or, I guess, if you were frozen in a synchronous subsurface.

(I'm really regretting caving in and making it illegal to destroy an unreleased wl_buffer.)
Comment 6 Pekka Paalanen 2018-03-15 09:48:46 UTC
(In reply to Daniel Stone from comment #5)
> Correct, so the only correct (though incredibly counter-intuitive) course of
> action is to destroy the wl_surface before destroying the EGLSurface.
> Alternately, destroy the wl_surface role object in such a way which
> guarantees it's unmapped.

I would not like that. You create wl_surface, then wl_egl_surface, then EGLSurface, and destroy them in the exact opposite order. That's the only sane thing I can imagine. Otherwise various places will hold stale pointers, even if you could trust on your luck that those pointers will not be dereferenced when destroying the other bits.

> The fact people don't actually do this is why Mesa just immediately destroys
> _all_ buffers - released or not - when the EGLSurface is destroyed. If the
> buffer hasn't been released, you might catch a client error.

A client misbehaviour more like. It would be good to have the app unmap the window before the EGLSurface gets destroyed, but I don't think we can make that a hard requirement anymore. To me it seems perfectly logical that destroying the EGLSurface will release all framebuffer resources.

> That being said, I was suggesting 'orphaned' wl_buffers: i.e. where they are
> still marked busy, but have been discarded from the surface list due to a
> resize request. This happens on the first draw call after a surface resize,
> so the only point which a buffer could be so orphaned and not have a release
> scheduled is if you did: wl_egl_window_resize(); glClear();
> eglDestroySurface(); without swapping. Or, I guess, if you were frozen in a
> synchronous subsurface.

I don't think it's realistic to ban such sequences. The very least, they should not result in a crash.

> (I'm really regretting caving in and making it illegal to destroy an
> unreleased wl_buffer.)

Where was it made illegal? It may just cause a glitch on screen. If it's illegal, we should have compositors raise a fatal protocol error on destroy before release. Is that implemented somewhere?

I believe destroying the wl_buffers on EGLSurface destruction regardless whether they were released is the best solution. It allows the EGL to avoid crashes and stale pointers, it does not impose a very awkward and surprising destruction order, and apps still have the chance to guarantee there will not be any glitches by using a role-specific way to unmap the wl_surface before destroying the EGLSurface.
Comment 7 GitLab Migration User 2019-09-18 18:09:09 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/172.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.