Bug 33934

Summary:	3D blitting is orders of magnitude slower than equivalent 2D blitting.
Product:	Mesa	Reporter:	Neil Roberts <nroberts>
Component:	Drivers/DRI/i965	Assignee:	Chris Wilson <chris>
Status:	RESOLVED FIXED	QA Contact:
Severity:	enhancement
Priority:	medium	CC:	liquid.acid
Version:	git
Hardware:	Other
OS:	All
Whiteboard:
i915 platform:		i915 features:
Attachments:	meta: Try using glCopyTexSubImage2D in _mesa_meta_BlitFramebuffer Test case showing the performance difference Move variations for blitting

Description Neil Roberts 2011-02-05 05:50:10 UTC

The Mesa meta implementation of glBlitFramebuffer (which seems to be used by most of the drivers) performs a render when the source and destination framebuffers are textures. However I think this function is also commonly used to copy a region between two textures without scaling. In this case it would be good if this could boil down to a hardware blit rather than having to submit geometry.

One way to do this could be to use glCopyTexSubImage2D in the meta code. This seems to end up being a blit on at least the i965 and Radeon drivers.

This came about because in Clutter we were getting some pressure to use glBlitFramebuffer instead of glCopyTexSubImage to migrate images between atlas textures because it is faster on some (non-mesa) drivers. However I noticed that the opposite is true for Mesa.

Comment 1 Neil Roberts 2011-02-05 05:51:49 UTC

Created attachment 42961 [details] [review]
meta: Try using glCopyTexSubImage2D in _mesa_meta_BlitFramebuffer

In the case where glBlitFramebuffer is being used to copy to a texture
without scaling it is faster if we can use the hardware to do a blit
rather than having to do a texture render. In most of the drivers
glCopyTexSubImage2D will use a blit so this patch makes it check for
when glBlitFramebuffer is doing a simple copy and then divert to
glCopyTexSubImage2D.

Comment 2 Marek Olšák 2011-02-06 06:40:45 UTC

Both r300g and r600g use a 3D blit (by drawing a textured quad), scaling or no scaling, it doesn't matter. AFAIK, r600g has no other way to do a blit. Your commit would not change anything on this hardware.

Since the BlitFramebuffer function in st/mesa is simpler than CopyTexSubImage, I don't consider this an enhancement.

Comment 3 Neil Roberts 2011-02-08 05:08:06 UTC

Created attachment 43096 [details]
Test case showing the performance difference

Well at least on the Intel driver there is a faster path for blitting that glCopyTexSubImage2D uses. If it's not also beneficial for Radeon then maybe we should move the patch to be specific to the Intel drivers.

Attached is a test case to get some timing for the two functions.

Without patch:

time for glBlitFramebuffer = 122285
time for glCopyTexSubImage2D = 6097

So glCopyTexSubImage2D is 1906% faster than glBlitFramebuffer.

With the patch I get:

time for glBlitFramebuffer = 25740
time for glCopyTexSubImage2D = 6900

The patch improves the speed of glBlitFramebuffer by 375% but it's still pretty slow compared to glCopyTexSubImage2D. Maybe the cost of glBlitFramebuffer is mostly in preserving the GL state across the Mesa meta calls and the patch still does a bit of this. Maybe we should make a proper Intel-specific fast path for glBlitFramebuffer that directly calls intelEmitCopyBlit like do_copy_texsubimage does so that it can avoid affecting the GL state.

Comment 4 Marek Olšák 2011-02-08 05:58:38 UTC

There is almost no performance difference on Radeons. I guess the patch should be made Intel-only.

Comment 5 Chris Wilson 2011-02-08 06:21:23 UTC

Another perspective. Running without the patch:

=0 sandybridge:~$ DISPLAY=:0.0 ./copy-tex-subimage # gt1
time for glBlitFramebuffer = 119331
time for glCopyTexSubImage2D = 1518
=0 sandybridge:~$ DISPLAY=:0.1 ./copy-tex-subimage # radeon 5770, r600g
time for glBlitFramebuffer = 15952
time for glCopyTexSubImage2D = 16237

And after applying the patch:

=0 sandybridge:~$ DISPLAY=:0.0 jhbuild run ./copy-tex-subimage
time for glBlitFramebuffer = 4706
time for glCopyTexSubImage2D = 1519
=0 sandybridge:~$ DISPLAY=:0.1 jhbuild run ./copy-tex-subimage
time for glBlitFramebuffer = 16318
time for glCopyTexSubImage2D = 16649

Which is less of a case that we need to implement a fast path for glBlitFramebuffer, but that i965 needs to seriously fix the bottleneck uncovered by the current code.

Comment 6 Chris Wilson 2011-02-08 06:39:32 UTC

Marek, this appears to be an intel issue, agreed?

Comment 7 Marek Olšák 2011-02-08 07:05:22 UTC

(In reply to comment #6)
> Marek, this appears to be an intel issue, agreed?

Yes, I agree.

Comment 8 Chris Wilson 2011-02-09 04:50:30 UTC

This is where I'm up to (reducing meta-op overhead vs patch):
                                     q35             snb
glBlitFramebuffer 1x1           29336  8833     22606  3990
glBlitFramebuffer 2x2            7360  2235      5684  1006
glBlitFramebuffer 4x4            1869  1440       587   269
glBlitFramebuffer 8x8             493   174       377    83
glBlitFramebuffer 16x16           151    70       112    38
glBlitFramebuffer 32x32            65    45        44    36
glBlitFramebuffer 64x64            43    38        28    24
glBlitFramebuffer 128x128          38    36        23    22
glBlitFramebuffer 256x256          36    36        23    23
glBlitFramebuffer 512x512          36    36        22    22
glBlitFramebuffer 1024x1024        36    35        22    23

glCopyTexSubImage2D 1x1          3861            1350
glCopyTexSubImage2D 2x2           990             355
glCopyTexSubImage2D 4x4           274             106
glCopyTexSubImage2D 8x8            96              43   
glCopyTexSubImage2D 16x16          51              27
glCopyTexSubImage2D 32x32          39              23   
glCopyTexSubImage2D 64x64          37              22
glCopyTexSubImage2D 128x128        36              22
glCopyTexSubImage2D 256x256        36              22   
glCopyTexSubImage2D 512x512        36              22
glCopyTexSubImage2D 1024x1024      36              22

Comment 9 Chris Wilson 2011-02-09 09:16:45 UTC

To put the speed into perspective, I tried a couple of other variations, essentially unrolling the meta-op blit in the test itself. On the q35:

                Blit            Quads           Tristrip        Copy
1x1             29393           638             212             3786
2x2              7339           167             110              955
4x4              1852            48              20              245
8x8               469            18              15               68
16x16             123            12               9               23
32x32              38             9               9               14
64x64              16             9               9                9
128x128            10             8               9                8
256x256             9             9               9                9
512x512             9             9               9                9
1024x1024           9             9               8                9

Comment 10 Chris Wilson 2011-02-09 09:30:13 UTC

Created attachment 43167 [details] [review]
Move variations for blitting

Comment 11 Chris Wilson 2011-02-24 08:49:27 UTC

Having spent a couple of weeks tackling the underlying problem of why snb was so slow, I've finally ported the meta-op to intel and pushed.

On applying:
       1x1     2x2     4x4     8x8     16x16   32x32   128x128 256x256 512x512
Blit:   2113    532     134     34      8       3       1       1       0
Quads:  141     265     66      3       1       0       0       0       0
Tri:    90      133     35      1       1       0       1       0       0
Copy:   1749    437     109     28      7       2       0       1       0

so glBlitFramebuffer is now only marginally slower than glCopySubImage2D and hopefully adequate.


commit c0ad70ae31ee5501281b434d56e389fc92b13a3a
Author: Neil Roberts <neil@linux.intel.com>
Date:   Sat Feb 5 10:21:11 2011 +0000

    intel: Try using glCopyTexSubImage2D in _mesa_meta_BlitFramebuffer
    
    In the case where glBlitFramebuffer is being used to copy to a texture
    without scaling it is faster if we can use the hardware to do a blit
    rather than having to do a texture render. In most of the drivers
    glCopyTexSubImage2D will use a blit so this patch makes it check for
    when glBlitFramebuffer is doing a simple copy and then divert to
    glCopyTexSubImage2D.
    
    This was originally proposed as an extension to the common meta-ops.
    However, it was rejected as using the BLT is only advantageous for Intel
    hardware.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=33934
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.