Bug 96355 - Performance: extra&costly SSBO validation even when SSBO aren't used
Summary: Performance: extra&costly SSBO validation even when SSBO aren't used
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/DRI/nouveau (show other bugs)
Version: git
Hardware: Other All
: medium normal
Assignee: Nouveau Project
QA Contact: Nouveau Project
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-06-03 07:53 UTC by gregory.hainaut
Modified: 2016-06-06 09:17 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments

Description gregory.hainaut 2016-06-03 07:53:01 UTC
Hello,

I'm currently trying to profile my application (PCSX2) with Mesa. I don't know if my GPU (GTX760) is properly reclocked but my app is often CPU limited. It could just be the IO operation that are very slow.

Anyway, Perf-Event shows that nvc0_validate_buffers is (too) often called.
+    8.78%     7.98%  pcsx2_GSReplayL  nouveau_dri.so  nvc0_validate_buffers

My understanding of the code is that every time we switch a shader program, a full SSBO bind/validation is called. nvc0_set_shader_buffers will dirty buffer state (NVC0_NEW_3D_BUFFERS). The trick is that my application doesn't use SSBO (only UBO). Is it expected to call SSBO validation code when the shader program doesn't use them? If not, a validation shortcut will be nice.

If it can help, here the backtrace from nvc0_set_shader_buffers 

#0  nvc0_set_shader_buffers (pipe=0x87c51e0, shader=0, start=16, nr=16, buffers=0x0) at nvc0/nvc0_state.c:1331
#1  0xf464acc4 in st_bind_ssbos (shader=0x8b106bc, shader_type=0, st=0x877ca38, st=0x877ca38) at state_tracker/st_atom_storagebuf.c:86
#2  0xf464ad0d in bind_vs_ssbos (st=0x877ca38) at state_tracker/st_atom_storagebuf.c:101
#3  0xf4647411 in st_validate_state (st=0x877ca38, pipeline=ST_PIPELINE_RENDER) at state_tracker/st_atom.c:289
#4  0xf46638ef in st_draw_vbo (ctx=0x8801f60, prims=0xffffa990, nr_prims=1, ib=0xffffa980, index_bounds_valid=0 '\000', min_index=4294967295, max_index=4294967295, tfb_vertcount=0x0, 
    stream=0, indirect=0x0) at state_tracker/st_draw.c:176
#5  0xf46270f9 in vbo_validated_drawrangeelements (ctx=ctx@entry=0x8801f60, mode=mode@entry=4, index_bounds_valid=0 '\000', start=4294967295, end=4294967295, count=6, type=5125, 
    indices=0x25258, basevertex=19047, numInstances=1, baseInstance=0) at vbo/vbo_exec_array.c:849
#6  0xf46274bc in vbo_exec_DrawElementsBaseVertex (mode=4, count=6, type=5125, indices=0x25258, basevertex=19047) at vbo/vbo_exec_array.c:1007
#7  0xf6ddf422 in shared_dispatch_stub_702 (mode=4, count=6, type=5125, indices=0x25258, basevertex=19047) at shared-glapi/glapi_mapi_tmp.h:21235
#8  0xf6362e0a in Draw (this=<optimized out>, this=<optimized out>, basevertex=<optimized out>, mode=<optimized out>)

Feel free to ask trace/debug info.

Best regards
Comment 1 gregory.hainaut 2016-06-03 08:01:45 UTC
As a side note, I potentially have a similar behavior with shader image (st_bind_*_images). I need to double check my engine as I used them sometimes.
Comment 2 Samuel Pitoiset 2016-06-03 08:17:51 UTC
Hi Gregory,

Thanks for profiling Nouveau with perf, that's very nice. :-)

Well, if your application doesn't use SSBO's, nvc0_validate_buffers() should not be called yeah. But this might happen when we switch between different contexts. Anyway, improving the validation path is on our todolist. :)

Well, according to your backtrace, nvc0_set_shader_buffers() is called and will dirty NVC0_NEW_3D_BUFFERS, which will then call nvc0_validate_buffers() at draw time.

I wonder why it's called if you are sure that your application doesn't use any SSBO's...

Can you extract some shaders from your application to make sure no SSBO's are used? You can use NV50_PROG_DEBUG=1 for example (this will dump the TGSI code).
Comment 3 gregory.hainaut 2016-06-03 15:15:23 UTC
Hi Samuel,

> Thanks for profiling Nouveau with perf, that's very nice. :-)

Well it is nice that I can do profiling :)

> Well, if your application doesn't use SSBO's, nvc0_validate_buffers() 
> should not be called yeah. But this might happen when we switch between
>  different contexts. Anyway, improving the validation path is on our todolist. :)

Yes, I'm sure. I don't know how to use SSBO.

> I wonder why it's called if you are sure that your application doesn't use
> any SSBO's...

src/mesa/state_tracker/st_atom_storagebuf.c
st_bind_*_ssbos struct contains the ST_NEW_*_PROGRAM flags.

So every time, you call glUseProgram (or the 4.1 pipeline equivalent), flags will be asserted and a validation will be triggered. It is the same for the image in st_bind_*_images struct in st_atom_image.c. It is nice for the performance.

> Can you extract some shaders from your application to make sure no SSBO's 
> are used? You can use NV50_PROG_DEBUG=1 for example (this will dump the TGSI code).

All my shader could be found in glsl format (bit a mess of ifdef but no SSBO ;))
https://github.com/PCSX2/pcsx2/tree/master/plugins/GSdx/res/glsl

Here an example (I'm not sure if it is the TGSI format).

FRAG
DCL IN[0], GENERIC[0], PERSPECTIVE
DCL IN[1], GENERIC[3], PERSPECTIVE
DCL OUT[0], COLOR
DCL OUT[1], COLOR[1]
DCL SAMP[0]
DCL SAMP[1]
DCL SVIEW[0], 2D, FLOAT
DCL SVIEW[1], 2D, FLOAT
DCL CONST[1][0]
DCL CONST[2][0..1]
DCL CONST[3][0..1]
DCL CONST[4][0]
DCL CONST[5][0..1]
DCL CONST[6][0..7]
DCL CONST[7][0]
DCL TEMP[0..1], LOCAL
IMM[0] FLT32 {    0.0000,   255.0000,     0.0500,     0.0078}
IMM[1] FLT32 {    0.0039,     0.0000,     0.0000,     0.0000}
  0: MOV TEMP[0].xy, IN[1].xyyy
  1: TEX TEMP[0].w, TEMP[0], SAMP[0], 2D
  2: MOV TEMP[1].y, IMM[0].xxxx
  3: MOV TEMP[1].x, TEMP[0].wwww
  4: TRUNC TEMP[0], IN[0]
  5: MOV TEMP[1].xy, TEMP[1].xyyy
  6: TEX TEMP[1], TEMP[1], SAMP[1], 2D
  7: MAD TEMP[1], TEMP[1], IMM[0].yyyy, IMM[0].zzzz
  8: TRUNC TEMP[1], TEMP[1]
  9: MUL TEMP[0], TEMP[0], TEMP[1]
 10: MUL TEMP[0], TEMP[0], IMM[0].wwww
 11: TRUNC TEMP[0], TEMP[0]
 12: MIN TEMP[0], TEMP[0], IMM[0].yyyy
 13: MUL TEMP[1], TEMP[0], IMM[1].xxxx
 14: MUL TEMP[0].x, TEMP[0].wwww, IMM[0].wwww
 15: MOV OUT[0], TEMP[1]
 16: MOV OUT[1], TEMP[0].xxxx
 17: END
Comment 4 Ilia Mirkin 2016-06-03 15:32:22 UTC
Right ... other things deal with this by using the cso_cache (or the backend driver handles it). We probably should for this as well. Add a per-buffer dirty bit and only set it if it's actually changed. Or add it to the cso_context logic.
Comment 5 Samuel Pitoiset 2016-06-03 16:16:06 UTC
Thanks for the report.

We will fix it.
Comment 6 gregory.hainaut 2016-06-04 11:32:44 UTC
Thanks you.

I did a quick benchmark of my testcase:

raw GIT 
   => Mean by frame: 32.083336ms (31.168831fps)

GIT + hack to remove the new program flags from SSBO and images
   => Mean by frame: 21.586538ms (46.325169fps)

Note: testcase uses lots of shader bind, so I guess it is kinds of a worst case for the perf.
Comment 7 Ilia Mirkin 2016-06-05 03:53:15 UTC
I've pushed out some changes to nvc0 to reduce overhead of updating ssbo/images.

There are additional patches I've sent out to validate ssbo/images more often in the st (right now we miss some cases).

Let me know if the profile looks any better now.
Comment 8 Gediminas Jakutis 2016-06-05 10:18:39 UTC
I don't know about the reporter's case, but I have ran some benchmarks and tests with f018456901ee291181ecce74c30b19c9f6731f06 (latest revision before those four patches) and fd6bbc2ee205ed02f66a8d8ef5b2adf4005d588c (the latest revision, with the four patches) on my GTX 770 + FX-8320 @ 4.1GHz, focusing on CPU-bound cases.

The results are all to the better - on most games I tested I see 4-10% performance boost. Am only going to list a pair of highlights:

· Age of Wonders III, my own severely CPU limited testcase: 21 fps -> 26 fps, a jump by a whooping 23.8% (still CPU-bound, though).
· Payday 2, well, this game has no [reproducable] way to benchmark it, but the gameplay used to be nightmare filled with severe rubber-banding, running just some 18-22 fps in many situations, all while painfully CPU-bound. Now, most of rubber-banding is either gone or is a lot less noticeable. The framerate in these aforementioned situations went up to 25-60; dipping below 30 very rarely, while mostly maintaining over 2x performance boost. Basically, these four patches made the game *playable* on nouveau. (The game is still very painfully CPU-bound, though.)

So, at least here, I can see clear performance benefits.
Will leave to be marked as RESOLVED by the reporter; don't want to hijack his issue.
Comment 9 gregory.hainaut 2016-06-05 11:09:09 UTC
Hello,

It is much better. I disabled my cpu turbo to reduce perf variation hence the smaller value.

I'm now around 33-34 fps with latest git. For reference, if I disable validation completely validation with an hack. I'm around 35-36fps.

It isn't completely free but it feels good enough. Maybe one can create a benchmark test ping-pong between 2 differents programs (could be the same compiled twice). Issue can be closed.
Comment 10 gregory.hainaut 2016-06-05 14:42:20 UTC
Hi Ilia,

You told me by IRC that you validate all SSBOs when one is updated. I suspecting a similar patter for UBO. I.e. all UBOs are validated when one is updated.

Potentially validation is even done for all shader stages. Anyway, I move a bit my UBO declaration to reduce the number of active UBO for a draw call. And I managed to win a couples of fps (67 fps => 70 fps).

So it might worth to investigate further the single SSBO/UBO bucket validation.
Comment 11 Ilia Mirkin 2016-06-05 15:41:58 UTC
(In reply to gregory.hainaut from comment #10)
> Hi Ilia,
> 
> You told me by IRC that you validate all SSBOs when one is updated. I
> suspecting a similar patter for UBO. I.e. all UBOs are validated when one is
> updated.

Nope. UBOs (and textures) have their individual validation "buckets".

> 
> Potentially validation is even done for all shader stages. Anyway, I move a
> bit my UBO declaration to reduce the number of active UBO for a draw call.
> And I managed to win a couples of fps (67 fps => 70 fps).
> 
> So it might worth to investigate further the single SSBO/UBO bucket
> validation.

There are different stages of validation. It's all extremely confusing. st/mesa validates everything, because it has to - which UBO is bound to where is based on program uniform settings:

      binding = &st->ctx->UniformBufferBindings[shader->UniformBlocks[i]->Binding];

So if either of those are updated, we have to revalidate. However there's a CSO cache backing UBOs, which will avoid propagating the set to the backend if nothing has changed.

I don't think we can do much better than this without some much larger rejiggers.

Perhaps there are still some things we can do to speed up common scenarios like "there are no ubos" or "there are no ssbos" or "there are no images". But it doesn't seem immediately apparent to me.
Comment 12 gregory.hainaut 2016-06-06 08:11:33 UTC
Actually what I saw is that all UBOs are validated when programs are switched. But I guess it is normal. I need to dig further. Thanks for the fixes.
Comment 13 Karol Herbst 2016-06-06 09:17:44 UTC
(In reply to Gediminas Jakutis from comment #8)
> I don't know about the reporter's case, but I have ran some benchmarks and
> tests with f018456901ee291181ecce74c30b19c9f6731f06 (latest revision before
> those four patches) and fd6bbc2ee205ed02f66a8d8ef5b2adf4005d588c (the latest
> revision, with the four patches) on my GTX 770 + FX-8320 @ 4.1GHz, focusing
> on CPU-bound cases.
> 
> The results are all to the better - on most games I tested I see 4-10%
> performance boost. Am only going to list a pair of highlights:
> 
> · Age of Wonders III, my own severely CPU limited testcase: 21 fps -> 26
> fps, a jump by a whooping 23.8% (still CPU-bound, though).
> · Payday 2, well, this game has no [reproducable] way to benchmark it, but
> the gameplay used to be nightmare filled with severe rubber-banding, running
> just some 18-22 fps in many situations, all while painfully CPU-bound. Now,
> most of rubber-banding is either gone or is a lot less noticeable. The
> framerate in these aforementioned situations went up to 25-60; dipping below
> 30 very rarely, while mostly maintaining over 2x performance boost.
> Basically, these four patches made the game *playable* on nouveau. (The game
> is still very painfully CPU-bound, though.)
> 
> So, at least here, I can see clear performance benefits.
> Will leave to be marked as RESOLVED by the reporter; don't want to hijack
> his issue.

I saw the same thing with PAYDAY 2, but I couldn't restore the low perf so I guess they just reworked their engine while they added the SMAA and SSAO thing, so I doubt those patches had anything to do with that :/


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.