Bug 40401

Summary: Cogs is slow
Product: Mesa Reporter: Pavel Ondračka <pavel.ondracka>
Component: Mesa coreAssignee: mesa-dev
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: medium CC: niels_ole, ptpzz, sa, thejoe
Version: gitKeywords: patch
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:

Description Pavel Ondračka 2011-08-26 11:07:54 UTC
The game Cogs is very slow with the r300g (and r600g) driver. This was originally mentioned in bug 39572, however that bug was about different problem and is already closed. There are two experimental patches for r300g and r600g in that bug, so I'm opening a new bug to keep track of them. Setting component to mesa core because the problem seems to be in u_vbuf_manager.

Patches from Vadim Girlin:
https://bugs.freedesktop.org/attachment.cgi?id=49707
https://bugs.freedesktop.org/attachment.cgi?id=49735

Those patches makes Cogs much faster.
Comment 1 Vadim Girlin 2011-08-26 17:35:22 UTC
(In reply to comment #0)
> The game Cogs is very slow with the r300g (and r600g) driver. This was
> originally mentioned in bug 39572, however that bug was about different problem
> and is already closed. There are two experimental patches for r300g and r600g
> in that bug, so I'm opening a new bug to keep track of them. Setting component
> to mesa core because the problem seems to be in u_vbuf_manager.
> 

I think it's not a mesa problem, it's a problem of the game which uses opengl
in the way which is clearly described in the gl spec as the way to get reduced
performance. This patch is a workaround for the game problem and it was written
just to check that I located the problem correctly. All other (properly
optimized) apps won't benefit from this patch, probably they'll become a bit
slower due to additional overhead with this workaround. That's why I'm not sure
that I want to submit such patches to mesa.

The best solution is to fix the game itself, so I think it makes sense to ask
the game developers about it (or to open the sources).
Comment 2 Pavel Ondračka 2011-08-26 23:17:40 UTC
(In reply to comment #1)
> (In reply to comment #0)
> > The game Cogs is very slow with the r300g (and r600g) driver. This was
> > originally mentioned in bug 39572, however that bug was about different problem
> > and is already closed. There are two experimental patches for r300g and r600g
> > in that bug, so I'm opening a new bug to keep track of them. Setting component
> > to mesa core because the problem seems to be in u_vbuf_manager.
> > 
> 
> I think it's not a mesa problem, it's a problem of the game which uses opengl
> in the way which is clearly described in the gl spec as the way to get reduced
> performance. This patch is a workaround for the game problem and it was written
> just to check that I located the problem correctly. All other (properly
> optimized) apps won't benefit from this patch, probably they'll become a bit
> slower due to additional overhead with this workaround. That's why I'm not sure
> that I want to submit such patches to mesa.
> 
> The best solution is to fix the game itself, so I think it makes sense to ask
> the game developers about it (or to open the sources).

OK, I've posted about this at Cogs forums:
http://forum.lazy8studios.com/index.php?topic=532.0
we shall see what can Cogs developers say about this.
Comment 3 Pavel Ondračka 2011-08-26 23:42:23 UTC
According to Cogs developer: "The Cogs vertex data is already very heavily optimized into triangle strips using nVidia's algorithms to greatly reduce the amount of data that needs to be sent to the GPU.  If modifying a driver can result in a 10x performance improvement, the problem is the original driver."

I don't know what to think about this.
Comment 4 Md Imam Hossain 2011-08-27 00:04:45 UTC
Same here on Intel GMA 4500 and Mesa 7.12-devel
Comment 5 Vadim Girlin 2011-08-27 05:40:53 UTC
(In reply to comment #3)
> According to Cogs developer: "The Cogs vertex data is already very heavily
> optimized into triangle strips using nVidia's algorithms to greatly reduce the
> amount of data that needs to be sent to the GPU.  If modifying a driver can
> result in a 10x performance improvement, the problem is the original driver."
> 
> I don't know what to think about this.

When I was fixing original bug 39572, I downloaded and tried the demo on their site first (with wine). Running build of this game for another OS through additional layer (wine) results in excellent performance with the *original* driver. If without modifying the driver we can get a 10x performance improvement, I think the problem is not in the driver. 

In hope that the game developers are reading this: IIRR, when running through wine, vertex data is in the gl buffer objects, instead of client arrays, so that it's not needed to upload it with every draw call (and it's not needed to filter this data in the driver as these patches do). I'm not sure if it's wine's optimizations, or the game itself does it this way through d3d. Why not to do the same with opengl in the native linux build? This will make the game fast for *all* drivers, without the need for any patches or workarounds. IIRR the draw call which is so slow in the native linux build comes from Mesh::draw().
Comment 6 Marek Olšák 2011-08-27 06:10:02 UTC
We usually just try to optimize drivers even for a totally inefficient OpenGL usage and get the most performance out of it no matter what.

Some comments regarding the patches:
1) Can u_vbuf_mgr_draw_optimize be called inside u_vbuf_mgr_draw_begin instead of explicitly by a driver?
2) Can you directly write the new vertices into a buffer object instead of user memory, which must be uploaded to a buffer object at a later point anyway?
3) Could you please comment the code a bit more, e.g. how it interacts with the other parts of u_vbuf_mgr.
4) We could use the translate module to convert indexed vertices in several separate buffers into one non-indexed interleaved buffer. I think nv50 and nvc0 always do that for user buffers.

I think it's an optimization we definitely want.
Comment 7 Vadim Girlin 2011-08-27 09:44:03 UTC
(In reply to comment #6)
> We usually just try to optimize drivers even for a totally inefficient OpenGL
> usage and get the most performance out of it no matter what.
> 

OK, I understand that mesa would benefit from such optimization, but taking into consideration the time which I'll need to spend on analyzing different ways of doing it, implementing and testing, and potential benefits for applications, it wasn't a top priority for me.

These patches were intended for doing a quick test, handling just this single case. But if we want more universal solution, then I'm not sure that vbuf_mgr is a best place for this optimization - maybe it makes sense to do it somewhere else in the mesa, not on the driver level?

> Some comments regarding the patches:
> 1) Can u_vbuf_mgr_draw_optimize be called inside u_vbuf_mgr_draw_begin instead
> of explicitly by a driver?
IIRR I called it explicitly from the r600g because I wasn't sure initially if it's needed for (or will work with) r300g. It seems this call can be moved into the u_vbuf_mgr_draw_begin.

> 2) Can you directly write the new vertices into a buffer object instead of user
> memory, which must be uploaded to a buffer object at a later point anyway?
I was thinking about it, but it seemed to be more complicated, and I thought that when we avoided the copy for 4000 vertices, avoiding another copy for 4 vertices doesn't change much for quick test. But it seems possible to avoid this copy and memory allocation too.

> 3) Could you please comment the code a bit more, e.g. how it interacts with the
> other parts of u_vbuf_mgr.
It saves vertex buffer before replacing it with the newly created buffer, and sets boolean flag which means that it needs to restore vertex buffer and free allocated memory in the vbuf_mgr_draw_end, IIRR that's all interaction. (I don't remember now if it's really needed to save it, or maybe it's needed to save/restore even more data - e.g. index buffer too. I need to check the code to give the exact answer.)

> 4) We could use the translate module to convert indexed vertices in several
> separate buffers into one non-indexed interleaved buffer. I think nv50 and nvc0
> always do that for user buffers.
I was looking into it, but it seemed to be the overkill for the simple test. I just wanted to copy vertex data from single interleaved buffer as is, without any format conversions and/or other modifications. But maybe it makes sense to use it if we want to have universal solution.

> 
> I think it's an optimization we definitely want.
Should I try to improve these patches for vbuf_mgr according to your comments? Or may be it's better to reimplement it in some other way and in some other place?
Comment 8 Marek Olšák 2011-08-27 11:14:54 UTC
(In reply to comment #7)
> OK, I understand that mesa would benefit from such optimization, but taking
> into consideration the time which I'll need to spend on analyzing different
> ways of doing it, implementing and testing, and potential benefits for
> applications, it wasn't a top priority for me.
> 
> These patches were intended for doing a quick test, handling just this single
> case. But if we want more universal solution, then I'm not sure that vbuf_mgr
> is a best place for this optimization - maybe it makes sense to do it somewhere
> else in the mesa, not on the driver level?

I have been thinking about moving u_vbuf_mgr from the drivers to st/mesa or somewhere between drivers and state trackers, along with user buffer uploads. u_vbuf_mgr can be used by state trackers too, it's not limited to "just drivers".

However we should consider that certain drivers can fetch vertices from user memory just fine. Those are softpipe, llvmpipe, software vertex processing on r300 swtcl chipsets (e.g. ATI Xpress and AMD 690 chipsets), i915, and pretty much anything that makes use of draw_context. Doing vertex uploads on those already-slow drivers would make them even slower, obviously.


> IIRR I called it explicitly from the r600g because I wasn't sure initially if
> it's needed for (or will work with) r300g. It seems this call can be moved into
> the u_vbuf_mgr_draw_begin.

Vertex uploads and fallbacks work exactly the same in both r300g and r600g.

> 
> > 2) Can you directly write the new vertices into a buffer object instead of user
> > memory, which must be uploaded to a buffer object at a later point anyway?
> I was thinking about it, but it seemed to be more complicated, and I thought
> that when we avoided the copy for 4000 vertices, avoiding another copy for 4
> vertices doesn't change much for quick test. But it seems possible to avoid
> this copy and memory allocation too.
> 
> > 3) Could you please comment the code a bit more, e.g. how it interacts with the
> > other parts of u_vbuf_mgr.
> It saves vertex buffer before replacing it with the newly created buffer, and
> sets boolean flag which means that it needs to restore vertex buffer and free
> allocated memory in the vbuf_mgr_draw_end, IIRR that's all interaction. (I
> don't remember now if it's really needed to save it, or maybe it's needed to
> save/restore even more data - e.g. index buffer too. I need to check the code
> to give the exact answer.)

The index buffer isn't changed, so it's ok.

> 
> > 4) We could use the translate module to convert indexed vertices in several
> > separate buffers into one non-indexed interleaved buffer. I think nv50 and nvc0
> > always do that for user buffers.
> I was looking into it, but it seemed to be the overkill for the simple test. I
> just wanted to copy vertex data from single interleaved buffer as is, without
> any format conversions and/or other modifications. But maybe it makes sense to
> use it if we want to have universal solution.
> 
> > 
> > I think it's an optimization we definitely want.
> Should I try to improve these patches for vbuf_mgr according to your comments?
> Or may be it's better to reimplement it in some other way and in some other
> place?

Well, it's a completely different approach to uploading vertices. The translation of vertices into one non-indexed interleaved vertex buffer (as I was suggesting) could replace the rest of u_vbuf_mgr completely, i.e. uploads and format fallbacks. So there would have to be some code determining whether the rest of u_vbuf_mgr should be used (fine-grained uploads based on the min/max_index data) or the pure translate way (complete translation of all vertices into a non-indexed draw operation).

I am for using the translate module and doing complete translation in the cases where it's worth it, and using the rest of u_vbuf_mgr otherwise.
Comment 9 Pavel Ondračka 2011-08-27 12:37:32 UTC
(In reply to comment #4)
> Same here on Intel GMA 4500 and Mesa 7.12-devel

BTW I'm not a developer, but I think only gallium drivers are using u_vbuf_mgr. So even if those patches could make it in they won't probably solve this problem for you. Maybe you should open a new bug against i915 or i965 driver? Or is a universal solution possible?
Comment 10 Md Imam Hossain 2011-08-28 20:11:52 UTC
(In reply to comment #9)
> (In reply to comment #4)
> > Same here on Intel GMA 4500 and Mesa 7.12-devel
> BTW I'm not a developer, but I think only gallium drivers are using u_vbuf_mgr.
> So even if those patches could make it in they won't probably solve this
> problem for you. Maybe you should open a new bug against i915 or i965 driver?
> Or is a universal solution possible?

But the problem maybe is in the same component for the Intel drivers side.

No graphical glitch just a few frams per second

The problem with Linux Mesa Intel drivers is they support more features for example more OpenGL extensions than Windows Intel drivers but Windows drivers are bug and performance proof.
Comment 11 Tobias Jakobi 2011-08-29 02:03:49 UTC
(In reply to comment #10)
> But the problem maybe is in the same component for the Intel drivers side.
i965 is not using gallium, so no.
Comment 12 Md Imam Hossain 2011-08-29 04:54:26 UTC
(In reply to comment #11)
> (In reply to comment #10)
> > But the problem maybe is in the same component for the Intel drivers side.
> i965 is not using gallium, so no.

Sorry, I meant u_vbuf_manager equivalent component in the i965 driver not in gallium
Comment 13 Marek Olšák 2012-01-05 10:16:59 UTC
Fixed by ce44bae366ade59fb2dbdfbfe5a1ab8d24518a57 for radeons. You'll probably need separate bug reports for the other drivers if they have this issue. Closing.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.