98766 – We need fences support in Wayland compositors

Bug 98766 - We need fences support in Wayland compositors

Summary: We need fences support in Wayland compositors

Status:	RESOLVED MOVED

Alias:	None

Product:	Wayland
Classification:	Unclassified
Component:	weston (show other bugs)
Version:	unspecified
Hardware:	Other All

Importance:	medium normal
Assignee:	Wayland bug list
QA Contact:

URL:
Whiteboard:
Keywords:

Duplicates (1):	97380 (view as bug list)
Depends on:
Blocks:

Reported:	2016-11-17 23:00 UTC by Miguel A. Vico
Modified:	2018-06-08 23:55 UTC (History)
CC List:	11 users (show)

See Also:	98731
i915 platform:
i915 features:

Attachments

Description Miguel A. Vico 2016-11-17 23:00:11 UTC

There are several vendors that asynchronously commit wl_surface state changes to work around the limitation of wayland compositors using unfinished frames for composition.

Using unfinished frames causes the compositor to stall waiting for slow clients rendering to finish, missing frames from faster clients or even slowing them down if they are synchronized to compositor redraws.

Using EGL_NV_stream_fifo_synchronous on the client side to defer wl_surface.{attach, damage, commit} until a frame is finished is the way NVIDIA works around such limitation, but it goes against Wayland atomicity assumptions of surface updates.

This problem should be fixed on the compositor side. Using fences the compositor can query when a client buffer is finished and ready for composition.

There has been some discussion about this in https://bugs.freedesktop.org/show_bug.cgi?id=98731

Comment 1 Miguel A. Vico 2016-11-17 23:07:00 UTC

CC'ed some folks

Comment 2 Daniel Stone 2016-11-18 10:14:39 UTC

Thanks for opening this Miguel! I've CCed some non-Weston compositor people as well.

Comment 3 Martin Flöser 2016-11-18 10:21:37 UTC

What kind of fences should the compositor use? Is there already an EGL extension for it?

Comment 4 James Jones 2016-11-18 19:50:25 UTC

Linux fence FDs are the obvious first implementation choice, and there are some vendor-extensions to import/export them from EGLSync.

However, given the existance of Vulkan, EGLStreams, and non-Linux systems, it would be nice if the solution weren't too tied to the semantics of a specific type of fence object.  I'd like a Vulkan app to be able to send a cross-process Vulkan semaphore of some sort to a Vulkan-based compositor, for example, and preferably to be able to share it up-front and reference it later, rather than import/export it every frame.

Comment 5 Daniel Stone 2018-06-04 06:53:39 UTC

*** Bug 97380 has been marked as a duplicate of this bug. ***

Comment 6 Daniel Stone 2018-06-04 06:57:53 UTC

I submitted a protocol implementing this, as well as support for Mutter and Mesa:
https://lists.freedesktop.org/archives/wayland-devel/2017-September/035080.html

Comment 7 James Jones 2018-06-04 16:11:26 UTC

The protocol:

https://lists.freedesktop.org/archives/wayland-devel/2017-September/035080.html

Doesn't seem to support the persistent object model that would be more performant if using something like Vulkan semaphores to perform the synchronization, and requires dma-fence.  Isn't preferable to write this in a way that doesn't require a specific type of primitive and doesn't require transmitting/setting up a fence FD on every frame?

Comment 8 Daniel Stone 2018-06-04 16:18:02 UTC

(In reply to James Jones from comment #7)
> The protocol:
> 
> https://lists.freedesktop.org/archives/wayland-devel/2017-September/035080.
> html
> 
> Doesn't seem to support the persistent object model that would be more
> performant if using something like Vulkan semaphores to perform the
> synchronization, and requires dma-fence.  Isn't preferable to write this in
> a way that doesn't require a specific type of primitive and doesn't require
> transmitting/setting up a fence FD on every frame?

We could rewrite it to use a persistent syncobj I suppose, but I don't understand the performance argument in all honesty. Is it just the overhead of importing and exporting dma-fences on your driver?

Ultimately we need a dma-fence to work with KMS anyway, so Weston at least would probably just turn straight around and export the syncobj to a fence if the client was being displayed in a plane on the display controller. Unless I've missed something (please point me to it if so?) we also don't have an extension for EGL to ingest syncobjs, so we'd need export for that anyway.

If the performance argument is just about creating Wayland objects, I'm intensely relaxed about that. Creating objects does not require a roundtrip, and we already have a lot of throwaway objects, e.g. in the dmabuf extension where we create a new object just to take the buffer parameters and discard it as soon as the buffer is created.

Comment 9 James Jones 2018-06-04 16:27:03 UTC

Yes, the overhead is in the ioctls needed to instantiate a sync object into a usermode driver or another command queue in our kernel driver.  It's relatively expensive on NV hardware when not using a global GPU virtual address space.

There's an API to import general Vulkan sync primitives (Including dma fences/sync FDs, but also persistent-style Vulkan semaphores) directly to GL via the GL/Vulkan interop extensions which I believe Mesa supports.

KMS may need sync FDs, but not all compositing happens in KMS.  There's also the hypothetical-at-this-point Vulkan-only compositor implementation that doesn't use KMS.  Having Weston/Mesa/Mutter/etc. support only KMS and even only sync FDs/dma-fences is the current design norm, but baking that assumption in the protocol is another thing.  The core wayland protocols aimed for platform independence, and this seems like something that's likely to become a relatively central concept in Wayland frame delivery in the future.

Comment 10 Jason Ekstrand 2018-06-04 17:13:13 UTC

(In reply to James Jones from comment #9)
> Yes, the overhead is in the ioctls needed to instantiate a sync object into
> a usermode driver or another command queue in our kernel driver.  It's
> relatively expensive on NV hardware when not using a global GPU virtual
> address space.

I agree that there are some ways in which the sync file ioctls could be more efficient.  Our driver does N-1 sync file merge operations ioctls on each submit which isn't great especially if the compositor has dozens of clients each handing it sync files.  If we had a multi-merge ioctl, maybe a bunch of this overhead would go away.

> There's an API to import general Vulkan sync primitives (Including dma
> fences/sync FDs, but also persistent-style Vulkan semaphores) directly to GL
> via the GL/Vulkan interop extensions which I believe Mesa supports.

Some mesa drivers support it but not all.

In general, we seem to be getting back to the same discussion as EGLStreams about whether we build Wayland on top of Khronos APIs or Linux APIs.  Even if both Vulkan and GL support said extension, what about KMS?  What about some video encode API?  The advantage of sync_file is that it's supported everywhere and allows multiple components from multiple vendors to synchronize between each other.  Fun fact: at one point in time, there was a KMS-only (no 3D API involved) Weston back-end for the raspberry pi.

> KMS may need sync FDs, but not all compositing happens in KMS.  There's also
> the hypothetical-at-this-point Vulkan-only compositor implementation that
> doesn't use KMS.  Having Weston/Mesa/Mutter/etc. support only KMS and even
> only sync FDs/dma-fences is the current design norm, but baking that
> assumption in the protocol is another thing.

We have to bake some set of assumptions in.  We can bake in Linux/Android assumptions around sync_file or we can bake in Khronos API assumptions around GL, Vulkan, EGL, etc.  The question isn't if we bake in assumptions, it's which assumptions we bake in.  When we already have fairly well standardized APIs for modesetting and cross-component synchronization, why is the answer to keep wrapping them in more and more layers of Khronos APIs until EGL rules the world?

By the way, I'm not arguing that you don't have a real problem which really needs solving.  I'm well aware of the problem and it badly needs solving!  I just question the solution of adding more EGL (or Vulkan WSI).

Comment 11 James Jones 2018-06-04 17:46:48 UTC

Not proposing baking in any particular backend assumptions or debating whether Khronos or Linux assumptions are better, but rather I was indirectly suggesting a few modifications to make the overall extension less dependent on any particular type, and allow persistence

-Register fences as a new wayland object type prior to actual attachment, similar to how wl_surface abstracts DRM/shm/etc.  This makes persistence easy when the underlying objects support it, and since apparently wayland object creation is cheap, shouldn't add appreciable overhead where persistence doesn't matter or doesn't work.

-Explicitly name the type of the FD being registered.  If dma-fence/sync FD is the only type available in rev1, that's fine, but it's an easily extensible API.

Comment 12 Daniel Stone 2018-06-04 19:04:18 UTC

(In reply to James Jones from comment #9, and also comment #11)
> Yes, the overhead is in the ioctls needed to instantiate a sync object into
> a usermode driver or another command queue in our kernel driver.  It's
> relatively expensive on NV hardware when not using a global GPU virtual
> address space.
> 
> There's an API to import general Vulkan sync primitives (Including dma
> fences/sync FDs, but also persistent-style Vulkan semaphores) directly to GL
> via the GL/Vulkan interop extensions which I believe Mesa supports.

Sure, but the overhead of importing a semaphore into a Vulkan context, exporting that as an opaque FD, importing that into GL and then using it, seems higher than just importing a fence? Those extensions are also only synced for big GL: there's a question at the end concluding that there's no use for EGL/ES, since you can already import dma-fence FDs into EGLSync objects.

In a pure-Vulkan world, I can see the use for the equivalent of in-fences (from the compositor's PoV, what would be the signal semaphore of vkQueuePresentKHR's operation) to associate with wl_surface_attach using sync objects; it seems pretty obvious that the client would create it early (unsignaled), queue one signaling operation, and then the compositor would queue at least one wait operation.

But for the equivalent of out-fences (semaphore parameter to vkAcquireNextImageKHR, associated with wl_buffer_release), not so much? The compositor may queue at least one operation which would be the semaphore for the image inside vkAcquireNextImage. This might be an EGLSync/dma-fence signaling after the last EGL/GLES operation sourcing from the buffer, or a dma-fence from KMS, or a VkSemaphore from Vulkan composition. It looks like we'd need new API for the kernel, which would signal a syncobj when a particular collection of dma-fences had all signaled.

The compositor can't just take the client's wl_vk_semaphore object and instruct all operations using the buffer to signal the semaphore: there may be more than one (e.g. scanout + media encode), and the compositor doesn't necessarily know when queuing those operations that they will be the last operations queued on that buffer either. I assume that API of delayed fence -> syncobj signal chaining would still be prohibitively expensive for you though?

> -Explicitly name the type of the FD being registered.  If dma-fence/sync FD
> is the only type available in rev1, that's fine, but it's an easily
> extensible API.

It sounds like the protocol would need to encode a preference: 'I can take dma-fences but it's going to reduce your framerate' vs. 'I can take syncobjs but I'm just going to export them straight to fences'. And I have no idea which side should 'win' if the compositor prefers one synchronisation model and the client prefers another.

> KMS may need sync FDs, but not all compositing happens in KMS.  There's also
> the hypothetical-at-this-point Vulkan-only compositor implementation that
> doesn't use KMS. 

To be honest, it's not really clear to me at the moment what a Vulkan composition model would look like, so it's hard to say. For example, as far as I can tell there's no way to directly present VkImages with VK_KHR_display, only the result of the compositor's own Vulkan rendering, which means we can't usefully use overlay planes in a compositor yet.

Also, without either a totally omniscient planner ('here's how to recognise your scene graph and I guarantee this will work'), or the brute-force model we use with atomic KMS where we just throw potential configurations at the wall in a test-only mode until it finally sticks, we can't really hoist things into overlays. There's also the need for atomic modesetting, hotplug notifications, and so forth.

Maybe standing up a demo of how a VK_KHR_display-based compositor could look would be enlightening.

Comment 13 James Jones 2018-06-04 19:42:44 UTC

> Sure, but the overhead of importing a semaphore into a Vulkan context, 
> exporting that as an opaque FD, importing that into GL and then using it, 
> seems higher than just importing a fence? Those extensions are also only 
> synced for big GL: there's a question at the end concluding that there's no 
> use for EGL/ES, since you can already import dma-fence FDs into EGLSync 
> objects.

I'm not sure where this import/export/import path comes from?  And no, these are not BigGL only.  Can you point to the question you think implies this?  Look at the extension's dependencies section, and the notes in the "functions added" sections describing which apply to OpenGL Vs OpenGL ES only.  They're explicitly designed for both GLES and GL.

> The compositor can't just take the client's wl_vk_semaphore object and 
> instruct all operations using the buffer to signal the semaphore: there may be 
> more than one (e.g. scanout + media encode), and the compositor doesn't 
> necessarily know when queuing those operations that they will be the last 
> operations queued on that buffer either. I assume that API of delayed fence -> 
> syncobj signal chaining would still be prohibitively expensive for you though?

I think this is being over-thought.  For the out-fence, you've specified an event that provides an FD to the client.  Instead, you'd probably have a potentially "empty" wayland sync object thing (Let's please not try to make the naming slant the debate) pre-associated with the surface at attach time, and you could send an event when it was safe for the client to wait for it when releasing as-is (implying it was non-empty when attached), or optionally include an update to the object that associates a new FD with it (For the case where it was previously empty, or just needs some new primitive to back it for whatever reason).

> It sounds like the protocol would need to encode a preference: 'I can take 
> dma-fences but it's going to reduce your framerate' vs. 'I can take syncobjs 
> but I'm just going to export them straight to fences'. And I have no idea 
> which side should 'win' if the compositor prefers one synchronisation model 
> and the client prefers another.

I hadn't imagined the client/server would claim support for less-than-ideal primitives.  Similar to DRM-based buffers, you support sync-fd or you don't, both as a compositor and a client.  Presumably this would only really be useful with hardware-accelerated clients, and just like wl_drm stuff, you just keep the client API binding (EGL, Vulkan WSI, etc.) and compositor in sync.  Apps that don't reach down to do low-level buffer pushing themselves would be oblivious, and the handful that hand-code this would support what they support.  If sync fd gets embedded in enough hand-coded apps, it becomes a de-facto standard for some market.  This hasn't actually been a problem with DRM buffers thus far.  I'm just suggesting emulation of roughly that model for synchronization as well.

> Also, without either a totally omniscient planner ('here's how to recognise 
> your scene graph and I guarantee this will work'), or the brute-force model we 
> use with atomic KMS where we just throw potential configurations at the wall 
> in a test-only mode until it finally sticks, we can't really hoist things into 
> overlays. There's also the need for atomic modesetting, hotplug notifications, 
> and so forth.

I'm really trying to avoid debating the merits of a Vulkan compositor (It was just an example), but Vulkan direct-to-display has atomic modesetting (you can atomically present multiple swapchains, share images between swapchains), and display hotplug notification.

It would not be hard to write an extension that let's direct-to-display swapchain images be shareable or client-allocated.  It's only hard to share swapchain images in the general swapchain case, not for specific swapchain backends/VkSurface types.  I may write this at some point anyway for other reasons, but I've been waiting for an urgent use case to justify it.

Comment 14 Tomek Bury 2018-06-05 11:17:14 UTC

Hi Daniel,

I was wandering how (if?) are you going to implement the new buffer release mechanism in Weston GL renderer.

Where do you get the dma_fence object from? If I understand correctly, the dma_fence has to be signalled by the GL driver after the last glDraw involving wl_buffer has finished executing on the GPU. 

I checked quickly GL and EGL extensions but I can't see anything that would export dma_fence object the compositor could return with the wl_buffer.

Comment 15 Daniel Stone 2018-06-05 16:33:46 UTC

(In reply to James Jones from comment #13)
> > Sure, but the overhead of importing a semaphore into a Vulkan context, 
> > exporting that as an opaque FD, importing that into GL and then using it, 
> > seems higher than just importing a fence? Those extensions are also only 
> > synced for big GL: there's a question at the end concluding that there's no 
> > use for EGL/ES, since you can already import dma-fence FDs into EGLSync 
> > objects.
> 
> I'm not sure where this import/export/import path comes from?  And no, these
> are not BigGL only.  Can you point to the question you think implies this? 
> Look at the extension's dependencies section, and the notes in the
> "functions added" sections describing which apply to OpenGL Vs OpenGL ES
> only.  They're explicitly designed for both GLES and GL.

My reading of support for semaphore usage in GL_EXT_external_objects is that it requires GL_EXT_semaphore to be present; the spec asserts that 'GL_EXT_semaphore requires OpenGL 1.0' with (unlike the others) no mention of GLES.

The 'question at the end' bit came from GL_EXT_external_objects_fd, which notes that only opaque objects are supported for import, and that dma-fence FDs are better handled through EGLSync. That implies that we can only exchange semaphores when the client and compositor are running on the same device and version-locked, which isn't really the direction I was hoping to go in.

I don't think 'import/export/import' actually applies; please disregard that.

> > The compositor can't just take the client's wl_vk_semaphore object and 
> > instruct all operations using the buffer to signal the semaphore: there may be 
> > more than one (e.g. scanout + media encode), and the compositor doesn't 
> > necessarily know when queuing those operations that they will be the last 
> > operations queued on that buffer either. I assume that API of delayed fence -> 
> > syncobj signal chaining would still be prohibitively expensive for you though?
> 
> I think this is being over-thought.  For the out-fence, you've specified an
> event that provides an FD to the client.  Instead, you'd probably have a
> potentially "empty" wayland sync object thing (Let's please not try to make
> the naming slant the debate) pre-associated with the surface at attach time,
> and you could send an event when it was safe for the client to wait for it
> when releasing as-is (implying it was non-empty when attached), or
> optionally include an update to the object that associates a new FD with it
> (For the case where it was previously empty, or just needs some new
> primitive to back it for whatever reason).

Right, I'm fine with a set of pre-created 'sync object things'. I'm just trying to figure out if the act of making the sync-object-thing (specifically created by client & compositor co-operating) be signalled by another sync-object thing (maybe a fence, maybe a VkSemaphore, maybe something else) which was created by the compositor, would have the same overhead you were talking about.

IOW, if passing back a dma-fence from the compositor's blit operation or the KMS scanout operation is too high overhead, I'm not seeing why chaining that fence into a client semaphore would be less overhead? It seems like they're semantically the same from a low level; the only difference is the use of an intermediate Wayland protocol object which should have no influence on the lower levels of the display stack.

> > It sounds like the protocol would need to encode a preference: 'I can take 
> > dma-fences but it's going to reduce your framerate' vs. 'I can take syncobjs 
> > but I'm just going to export them straight to fences'. And I have no idea 
> > which side should 'win' if the compositor prefers one synchronisation model 
> > and the client prefers another.
> 
> I hadn't imagined the client/server would claim support for less-than-ideal
> primitives.  Similar to DRM-based buffers, you support sync-fd or you don't,
> both as a compositor and a client.  Presumably this would only really be
> useful with hardware-accelerated clients, and just like wl_drm stuff, you
> just keep the client API binding (EGL, Vulkan WSI, etc.) and compositor in
> sync.  Apps that don't reach down to do low-level buffer pushing themselves
> would be oblivious, and the handful that hand-code this would support what
> they support.  If sync fd gets embedded in enough hand-coded apps, it
> becomes a de-facto standard for some market.  This hasn't actually been a
> problem with DRM buffers thus far.  I'm just suggesting emulation of roughly
> that model for synchronization as well.

I guess it's more of an interop thing. Right now we have dmabuf which works as well as it can within the constraints, e.g. linear only when sharing between different vendors. One of the explicit goals of the dmabuf extension, and of this one as well, was to extend that same sharing model to fences. Rather than just plumbing opaque FDs between version-locked drivers for the same device, it aimed to extend functionality to multiple disjoint GPUs, display controllers, media encode/decode engines, and so on.

That being said, I don't see a problem with specifically adding a type enum for anything which might come up in the future.

(In reply to Tomek Bury from comment #14)
> I was wandering how (if?) are you going to implement the new buffer release
> mechanism in Weston GL renderer.
> 
> Where do you get the dma_fence object from? If I understand correctly, the
> dma_fence has to be signalled by the GL driver after the last glDraw
> involving wl_buffer has finished executing on the GPU. 
> 
> I checked quickly GL and EGL extensions but I can't see anything that would
> export dma_fence object the compositor could return with the wl_buffer.

We can create an EGLSync object, request that it be signalled when the most recently submitted command has fully retired, and then use the Android native_fence_fd extension to get a dma-fence from that. The compositor then keeps around the fence used for its last GL composition per output (and for the last completed KMS scene), and we can pass whichever is applicable back to the client as a release fence. (Or embed it within another kind of sync object.)

Comment 16 Tomek Bury 2018-06-06 14:46:18 UTC

> We can create an EGLSync object, request that it be signalled when the most
> recently submitted command has fully retired, and then use the Android
> native_fence_fd extension to get a dma-fence from that.

That's not guaranteed to work. It's hard to tell what type file descriptor you'll get from VK or EGL. It can be either the old-style Android fence fd or new-style sync file fd.

Vulkan 1.1 says this:
"Handles of type VK_EXTERNAL_SEMAPHORE_HANDLE_TYPE_SYNC_FD_BIT generated by the
implementation may represent either Linux Sync Files or Android Fences at the
implementation’s discretion. Applications should only use operations defined for
both types of file descriptors, unless they know via means external to Vulkan the
type of the file descriptor, or are prepared to deal with the system-defined
operation failures resulting from using the wrong type."

The EGL_ANDROID_native_fence_sync extension is even more vague and only says that you'll get a "file descriptor that refers to a native fence object". It means different things on different Android versions but this is Wayland, not Android.

Comment 17 GitLab Migration User 2018-06-08 23:55:06 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/wayland/weston/issues/85.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.