Bug 105695 - [PERF] Updating ubo offset via vkCmdBindDescriptorSets is causing flush that is taking 50% of rendering time
Summary: [PERF] Updating ubo offset via vkCmdBindDescriptorSets is causing flush that ...
Status: RESOLVED NOTABUG
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Vulkan/intel (show other bugs)
Version: git
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Intel 3D Bugs Mailing List
QA Contact: Intel 3D Bugs Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-03-22 15:51 UTC by Vyacheslav
Modified: 2018-03-29 14:55 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments

Description Vyacheslav 2018-03-22 15:51:30 UTC
I do vkCmdBindDescriptorSets per draw call to change transformation matrix of an object. Basically I change only one number - offset into dynamic ubo. Then I call vkCmdDrawIndexed and it calls gen9_cmd_buffer_flush_state that amounts to 50% of rendering time (in terms of instr fetch metric). I dig deeper and find two culprits: flush_descriptor_sets, cmd_buffer_flush_push_constants. And I don't even use push constants, I prefer dynamic ubos. Emitting binding tables is huge amount of work (23% of total rendering time). I don't understand why so much work has to be done just to change offset in memory. And this is the most popular usecase - everyone wants to change matrix per object. My opengl implementation is 2 times faster than this. Are there any plans on improving performance in that area?

I got this data from valgrind profiling: https://www.dropbox.com/s/wsri01x69kwciyo/callgrind.cullingvk_ubo.out?dl=0
Comment 1 Jason Ekstrand 2018-03-24 00:35:35 UTC
(In reply to Vyacheslav from comment #0)
> I do vkCmdBindDescriptorSets per draw call to change transformation matrix
> of an object. Basically I change only one number - offset into dynamic ubo.
> Then I call vkCmdDrawIndexed and it calls gen9_cmd_buffer_flush_state that
> amounts to 50% of rendering time (in terms of instr fetch metric). I dig
> deeper and find two culprits: flush_descriptor_sets,
> cmd_buffer_flush_push_constants.

From your cachegrind profile, it appears that you have a build of the driver with asserts enabled.  Please try with an fully optimized driver build and see how the performance looks.  We have a *lot* of asserts in our driver (every field of every hardware packet has bounds-checking for instance) and enabling them will kill performance.

> And I don't even use push constants, I
> prefer dynamic ubos. Emitting binding tables is huge amount of work (23% of
> total rendering time). I don't understand why so much work has to be done
> just to change offset in memory.

As far as push constants go, you are getting them whether you meant to or not.  We push chunks of UBOs when we can and it significantly helps UBO performance in most cases.

> And this is the most popular usecase - everyone wants to change matrix per
> object.

Yes and no.  If you really need to be changing matrices that often, there are other mechanisms that are more efficient.  Frequently engines will do a large draw (many vertices) with a single UBO with an array of matrices and index into the array something they pass in as a vertex attribute.  Doing thousands of back-to-back draws with descriptor sets re-binds in-between is basically a CPU overhead micro-benchmark.

> My opengl implementation is 2
> times faster than this. Are there any plans on improving performance in that
> area?

That definitely shouldn't be the case. :-)  I suspect this would change if you ran with a properly optimized build.
Comment 2 Vyacheslav 2018-03-28 13:55:58 UTC
You are right, I get better perf without asserts. It is still lagging behind opengl. I also checked proprietary nvidia driver and I also get similar results with vulkan slightly lagging behind. I think the issue can be closed but I'm still left with impression that opengl drivers are better at handling memory management in this case (passing 50000 matrices per frame via glUniformMatrix). Probably because opengl drivers are more mature and better at tracking dirty state.
Comment 3 Jason Ekstrand 2018-03-28 17:19:28 UTC
(In reply to Vyacheslav from comment #2)
> You are right, I get better perf without asserts. It is still lagging behind
> opengl. I also checked proprietary nvidia driver and I also get similar
> results with vulkan slightly lagging behind.

How big is the discrepancy?  I'm a bit surprised if there is much but I could see it happening.  If you're CPU bound, what is the CPU overhead of your app (not the driver) in the Vulkan vs. GL configuration?

> I think the issue can be closed
> but I'm still left with impression that opengl drivers are better at
> handling memory management in this case (passing 50000 matrices per frame
> via glUniformMatrix). Probably because opengl drivers are more mature and
> better at tracking dirty state.

More to the point, I think OpenGL drivers are highly optimized for repeatedly changing a uniform and drawing as that was the best practice 5-10 years ago.  Modern applications tend to try very hard to reduce the number of draw calls and state changes.  Even if the CPU overhead of changing the matrix 50k times is low, it's likely to cause quite a bit of GPU overhead having all those state changes and the stalling that likely comes with them.
Comment 4 Vyacheslav 2018-03-29 14:55:51 UTC
I found the problem with my code - I did vkQueueWaitIdle() after each frame and it was taking up 80% of rendering time (yes, I'm still learning vulkan). I changed it to waiting on fences that are passed to vkQueueSubmit(). I have triple buffering, so I have three images in swap chain - 0,1,2. I record command buffers, submit and present 0,1,2 without waiting, then on frame 3 I wait on fence for completion of rendering of frame 0. On frame 4 I wait for completion of frame 1 and so on. I know I can do more than 3 but that alone gave me a big performance increase. Now in vulkan I get 28ms/frame, in opengl 58ms/frame. And on vulkan cpu is load is <70%, in opengl cpu load is constantly >95%. So lesson learned - proper synchronization is important.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.