Bug 22127 - [UXA] 50% performance regression for XRenderFillRectangles
Summary: [UXA] 50% performance regression for XRenderFillRectangles
Status: RESOLVED FIXED
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/intel (show other bugs)
Version: unspecified
Hardware: Other All
: low normal
Assignee: Chris Wilson
QA Contact: Xorg Project Team
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-06-06 15:57 UTC by Clemens Eisserer
Modified: 2010-06-26 03:57 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
xorg log (27.92 KB, text/plain)
2009-06-06 15:57 UTC, Clemens Eisserer
no flags Details

Description Clemens Eisserer 2009-06-06 15:57:06 UTC
Created attachment 26501 [details]
xorg log

The XRenderFillRectangles test of JXRenderMark shows a 50% performance regression when running on UXA+KMS compared to plain EXA.
The test simulates scanline based rendering using rectangles.

UXA:
93812.748533 Ops/s; rects (!); 15x15
23241.206030 Ops/s; rects (!); 75x75
7683.601740 Ops/s; rects (!); 250x250

EXA:
155345.525653 Ops/s; rects (!); 15x15
45269.111764 Ops/s; rects (!); 75x75
15016.181230 Ops/s; rects (!); 250x250

The test can be obtained from:
http://78.31.67.79:8080/jxrender/RenderMark.html
Comment 1 Clemens Eisserer 2009-06-14 09:58:19 UTC
To add something to this bug (which won't be looked at anyway):

This is exactly what java does for aliased shapes, so its not only a synthetic benchmark. The JXRenderTest was designed to test paths used by the Java2d XRender backend.
Comment 2 Carl Worth 2009-06-15 12:55:05 UTC
(In reply to comment #1)
> To add something to this bug (which won't be looked at anyway):

Hi Clemens,

I'm not sure what the point is in suggesting bugs won't get looked at. I am sorry if you feel like you're being ignored for some reason, but we really do care about bugs and do want to fix them. Please chalk things us to us having more to do than we can as opposed to any kind of intentional negligence.

> This is exactly what java does for aliased shapes, so its not only a synthetic
> benchmark. The JXRenderTest was designed to test paths used by the Java2d
> XRender backend.

I wonder if the 50% drop you're getting here could be related to the 50% difference in performance I'm getting with cairo-image trapezoid rendering compared to cairo-xlib+UXA trapezoid rendering, (even after I changed UXA to directly use pixman to render to a system-memory buffer). Or it could just be coincidental.

Anyway, you've mentioned that you can see the performance change when switching from EXA to UXA. Which version of the driver (and kernel) where you using when you did this test?

Thanks,

-Carl

Comment 3 Clemens Eisserer 2009-06-16 10:20:08 UTC
Hi Carl,

This is on kernel-2.6.30-6.fc12.i586 and intel-2.7.0, both deployed by fedora rawhide. 
Yes I saw the difference between EXA and UXA+kms, on the same system.
I got curious because Fedora8 with intel-2.2.1+EXA+Xorg-1.3 performed much better in this test compared to 2.7.0+UXA+kms on Fedora-11.

Sorry for my unfriendly response, its just that 16917 & 17933 have been filed ~almost a year ago and still make me suffer.
In early 2010 the XRender Java2D backend most likely will be deployed large scale, and people using Java applications will experience artifacts, not just in corner-cases.

And now with UXA, this bug & 18075 make performance suffer too. Eric had some suggestions to improve performance for 18075, but that didn't work out too well. 
Unfourtunatly this is the way Java2D does all its antialiasing, and it works well on OpenGL, Direct3D and is at least acceptable with EXA/XAA based drivers.
Eric mentioned that it would help to XPutImage into a malloc'ed memory for busy buffers, I had a look at the drm code but failed to do it myself.

Comment 4 Clemens Eisserer 2009-06-26 01:34:28 UTC
Hi Carl,

With 2.8 on 2.6.31-0.28.rc1.fc12.i586 I get even worse results, about half the throughput I got with 2.7.0.

50904.787836 Ops/s; rects (!); 15x15
12014.496179 Ops/s; rects (!); 75x75
3657.911342 Ops/s; rects (!); 250x250
34222.675803 Ops/s; rects composition (!); 15x15
10539.501111 Ops/s; rects composition (!); 75x75
3444.540728 Ops/s; rects composition (!); 250x250
7597.895967 Ops/s; put composition (!); 15x15
5656.032729 Ops/s; put composition (!); 75x75
1443.134083 Ops/s; put composition (!); 250x250

This could be because Fedora enabled some debug stuff in recent 2.6.31 kernels, as I can see most time is spent in kernel-code.
Most likely this is caused by syslog flooded with messages, and logging is to blame.
Comment 5 Clemens Eisserer 2009-06-29 14:38:31 UTC
I now can confirm that with 2.8 on rawhide the results are even worse than with 2.7.0 (that patched one deployed with Fedora 11), even with kmemleak disabled:

62133.192307 Ops/s; rects (!); 15x15

This is way less than what I got with EXA/2.7:
155345.525653 Ops/s; rects (!); 15x15

Please have a look at it if you've some time left.
Comment 6 Hans-Christian Jansen 2009-07-21 04:24:17 UTC
Following the discussion on Intel-gfx I benchmarked my laptop (P4-2.6ghz, Geforce2Go (low cost mobile gpu) + propietary legacy drivers, 5 year old):

202196.948344 Ops/s; rects (!); 15x15
52850.677392 Ops/s; rects (!); 75x75
11221.858765 Ops/s; rects (!); 250x250

This 5yo machine yields better results than the Core2Duo machine with Intel-gpu the submitter benchmarked on.
Comment 7 Clemens Eisserer 2009-07-29 09:25:48 UTC
Could you please run the benchmark inhouse on 2.7+EXA+UMS vs. 2.8+UXA+KMS?
Should take not more than 5min.
Comment 8 Chris Wilson 2009-12-01 15:38:19 UTC
The regression here is entirely due to extra cost of relocations with UXA vs EXA (because the memory manager now lies in the kernel we cannot simply emit absolute offsets). Due to how FillRectangles is accelerated by using the 2D blitter which requires a relocation per command, for this benchmark we now requires thousands of relocations per batch buffer. Quite the increase in overhead - and this overhead has only increased over time as we add more paranoid command checking to the kernel.

One method we could use to avoid the extra relocation overhead is to switch to using the 3D unit for these operations - though that imposes its own set of restrictions. In short, I think continuing to optimise this code path is a dead end - the fillRects benchmark itself does not look representative of the task that I believe it is trying to replicate (scan-line rasterisation of polygons) simply because using the constant color is already hitting special code paths through the client stack. And I have alternate ideas on how scan-line rasterisation should be accelerated, relegating the use of XRenderFillRectangles still further.
Comment 9 Chris Wilson 2009-12-02 03:13:20 UTC
Though after reading the drm code once more, the current profile indicates that a substantial amount of time is wasted by batch management. Looks like we could recover around 25% of the performance on this synthetic benchmark [for a fastish g45] with a switch to a cairo-drm style batch manager. [Though I still argue that if the goal is polygon rasterisation, the benchmark is inadequate. ;-]
Comment 10 Chris Wilson 2010-03-26 03:19:57 UTC
miCompositeRects(), need I say more?

Before:
11997.466216 Ops/s; rects (!); 15x15
4941.396509 Ops/s; rects (!); 75x75
1660.617060 Ops/s; rects (!); 250x250
2016.944734 Ops/s; rects blended; 15x15
555.278174 Ops/s; rects blended; 75x75
177.135678 Ops/s; rects blended; 250x250
12453.686200 Ops/s; rects composition (!); 15x15
4770.370370 Ops/s; rects composition (!); 75x75
1712.837838 Ops/s; rects composition (!); 250x250

After tweaking:
39378.768844 Ops/s; rects (!); 15x15
12316.143498 Ops/s; rects (!); 75x75
3965.307203 Ops/s; rects (!); 250x250
37158.724832 Ops/s; rects blended; 15x15
13607.142857 Ops/s; rects blended; 75x75
4006.815703 Ops/s; rects blended; 250x250
17146.502836 Ops/s; rects composition (!); 15x15
6820.754717 Ops/s; rects composition (!); 75x75
2014.409222 Ops/s; rects composition (!); 250x250

Hmm.
Comment 11 Chris Wilson 2010-05-12 05:07:00 UTC
commit cb887cfc670bf63993bd313ff33927afb8198eae
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Mar 26 09:59:51 2010 +0000

    uxa: solid rects
    
    The cost of performing relocations outweigh the advantages of using the
    blitter for solids with lots of rectangles.
    
    References:
    
      Bug 22127 - [UXA] 50% performance regression for XRenderFillRectangles
      https://bugs.freedesktop.org/show_bug.cgi?id=22127
    
    By using the 3D pipeline we improve our performance by around 4x on
    i945, measured by the jxbench microbenchmark, and a factor of 10x by
    short-cutting to the 3D pipeline for blended rectangles.
    
    Before, on a i945GME:
      19982.412060 Ops/s; rects (!); 15x15
      9599.131693 Ops/s; rects (!); 75x75
      3803.654743 Ops/s; rects (!); 250x250
      6836.743772 Ops/s; rects blended; 15x15
      1443.750000 Ops/s; rects blended; 75x75
      495.335821 Ops/s; rects blended; 250x250
      23247.933884 Ops/s; rects composition (!); 15x15
      10993.073048 Ops/s; rects composition (!); 75x75
      3595.905172 Ops/s; rects composition (!); 250x250
    
    After:
      87271.145975 Ops/s; rects (!); 15x15
      32347.744361 Ops/s; rects (!); 75x75
      5884.177215 Ops/s; rects (!); 250x250
      73500.000000 Ops/s; rects blended; 15x15
      33580.882353 Ops/s; rects blended; 75x75
      5858.811749 Ops/s; rects blended; 250x250
      25582.317073 Ops/s; rects composition (!); 15x15
      6664.728682 Ops/s; rects composition (!); 75x75
      14965.909091 Ops/s; rects composition (!); 250x250 [suspicious]
    
    This has no impact on Cairo, but I have a suspicion from watching xtrace
    that Qt likes to blit thousands of 1x1 rectangles with the same colour.
    However, we are still around 2-3x slower than the reported figures for
    EXA!
    
That's about as fast as I can make it for the time being. Not sure where to go next...
Comment 12 Chris Wilson 2010-05-12 05:09:41 UTC
Clemens, do you remember on which system you originally benchmarked? I'm trying to understand whether UXA performance is really 2-3x lower than EXA.
Comment 13 Clemens Eisserer 2010-05-12 05:19:19 UTC
Hi Chris,

Thanks a lot for still working on that stuff.
Orginally I benchmarked on my i945GM based laptop with Fedora-8 running intel-2.2.1+EXA+Xorg-1.3, unfourtunatly I deleted that installation some time ago.

Running 2.6.32.11 + intel-2.11 I currently get:
28354.237316 Ops/s; rects (!); 15x15
16570.028991 Ops/s; rects (!); 75x75
4872.900679 Ops/s; rects (!); 250x250

Yes, unfourtunatly QT's X11 backend is in a very bad shape ... doing things much worse than mis-using rects for shapes ;)
Comment 14 Chris Wilson 2010-05-12 09:41:23 UTC
(In reply to comment #13)
> Hi Chris,
> 
> Thanks a lot for still working on that stuff.
> Orginally I benchmarked on my i945GM based laptop with Fedora-8 running
> intel-2.2.1+EXA+Xorg-1.3, unfourtunatly I deleted that installation some time
> ago.

That's actually a good tip for building a retro system, thanks. :)
Comment 15 Clemens Eisserer 2010-06-16 03:45:09 UTC
with 2.6.34 + intel-2.11.901 I get:

176116.033723 Ops/s; rects (!); 15x15
63137.245405 Ops/s; rects (!); 75x75
6650.071638 Ops/s; rects (!); 250x250

Actually even better than the results I reported for EXA :)

Thanks a lot, Clemens
Comment 16 Chris Wilson 2010-06-26 03:57:49 UTC
Clemens, I think this is about as good it gets. I've a few more general tweaks that will improve performance further, but I am not going to try and specifically improve this case.

Please do call to my attention to any area you think is still underperforming [on i915, for i965 please wait until I've at least had a chance to fix it first]...and yes I've already fixed the code paths drawn to my attention via Phoronix! ;-)

Marking as fixed since performance seems on a par with EXA..


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.