Created attachment 26501 [details] xorg log The XRenderFillRectangles test of JXRenderMark shows a 50% performance regression when running on UXA+KMS compared to plain EXA. The test simulates scanline based rendering using rectangles. UXA: 93812.748533 Ops/s; rects (!); 15x15 23241.206030 Ops/s; rects (!); 75x75 7683.601740 Ops/s; rects (!); 250x250 EXA: 155345.525653 Ops/s; rects (!); 15x15 45269.111764 Ops/s; rects (!); 75x75 15016.181230 Ops/s; rects (!); 250x250 The test can be obtained from: http://78.31.67.79:8080/jxrender/RenderMark.html
To add something to this bug (which won't be looked at anyway): This is exactly what java does for aliased shapes, so its not only a synthetic benchmark. The JXRenderTest was designed to test paths used by the Java2d XRender backend.
(In reply to comment #1) > To add something to this bug (which won't be looked at anyway): Hi Clemens, I'm not sure what the point is in suggesting bugs won't get looked at. I am sorry if you feel like you're being ignored for some reason, but we really do care about bugs and do want to fix them. Please chalk things us to us having more to do than we can as opposed to any kind of intentional negligence. > This is exactly what java does for aliased shapes, so its not only a synthetic > benchmark. The JXRenderTest was designed to test paths used by the Java2d > XRender backend. I wonder if the 50% drop you're getting here could be related to the 50% difference in performance I'm getting with cairo-image trapezoid rendering compared to cairo-xlib+UXA trapezoid rendering, (even after I changed UXA to directly use pixman to render to a system-memory buffer). Or it could just be coincidental. Anyway, you've mentioned that you can see the performance change when switching from EXA to UXA. Which version of the driver (and kernel) where you using when you did this test? Thanks, -Carl
Hi Carl, This is on kernel-2.6.30-6.fc12.i586 and intel-2.7.0, both deployed by fedora rawhide. Yes I saw the difference between EXA and UXA+kms, on the same system. I got curious because Fedora8 with intel-2.2.1+EXA+Xorg-1.3 performed much better in this test compared to 2.7.0+UXA+kms on Fedora-11. Sorry for my unfriendly response, its just that 16917 & 17933 have been filed ~almost a year ago and still make me suffer. In early 2010 the XRender Java2D backend most likely will be deployed large scale, and people using Java applications will experience artifacts, not just in corner-cases. And now with UXA, this bug & 18075 make performance suffer too. Eric had some suggestions to improve performance for 18075, but that didn't work out too well. Unfourtunatly this is the way Java2D does all its antialiasing, and it works well on OpenGL, Direct3D and is at least acceptable with EXA/XAA based drivers. Eric mentioned that it would help to XPutImage into a malloc'ed memory for busy buffers, I had a look at the drm code but failed to do it myself.
Hi Carl, With 2.8 on 2.6.31-0.28.rc1.fc12.i586 I get even worse results, about half the throughput I got with 2.7.0. 50904.787836 Ops/s; rects (!); 15x15 12014.496179 Ops/s; rects (!); 75x75 3657.911342 Ops/s; rects (!); 250x250 34222.675803 Ops/s; rects composition (!); 15x15 10539.501111 Ops/s; rects composition (!); 75x75 3444.540728 Ops/s; rects composition (!); 250x250 7597.895967 Ops/s; put composition (!); 15x15 5656.032729 Ops/s; put composition (!); 75x75 1443.134083 Ops/s; put composition (!); 250x250 This could be because Fedora enabled some debug stuff in recent 2.6.31 kernels, as I can see most time is spent in kernel-code. Most likely this is caused by syslog flooded with messages, and logging is to blame.
I now can confirm that with 2.8 on rawhide the results are even worse than with 2.7.0 (that patched one deployed with Fedora 11), even with kmemleak disabled: 62133.192307 Ops/s; rects (!); 15x15 This is way less than what I got with EXA/2.7: 155345.525653 Ops/s; rects (!); 15x15 Please have a look at it if you've some time left.
Following the discussion on Intel-gfx I benchmarked my laptop (P4-2.6ghz, Geforce2Go (low cost mobile gpu) + propietary legacy drivers, 5 year old): 202196.948344 Ops/s; rects (!); 15x15 52850.677392 Ops/s; rects (!); 75x75 11221.858765 Ops/s; rects (!); 250x250 This 5yo machine yields better results than the Core2Duo machine with Intel-gpu the submitter benchmarked on.
Could you please run the benchmark inhouse on 2.7+EXA+UMS vs. 2.8+UXA+KMS? Should take not more than 5min.
The regression here is entirely due to extra cost of relocations with UXA vs EXA (because the memory manager now lies in the kernel we cannot simply emit absolute offsets). Due to how FillRectangles is accelerated by using the 2D blitter which requires a relocation per command, for this benchmark we now requires thousands of relocations per batch buffer. Quite the increase in overhead - and this overhead has only increased over time as we add more paranoid command checking to the kernel. One method we could use to avoid the extra relocation overhead is to switch to using the 3D unit for these operations - though that imposes its own set of restrictions. In short, I think continuing to optimise this code path is a dead end - the fillRects benchmark itself does not look representative of the task that I believe it is trying to replicate (scan-line rasterisation of polygons) simply because using the constant color is already hitting special code paths through the client stack. And I have alternate ideas on how scan-line rasterisation should be accelerated, relegating the use of XRenderFillRectangles still further.
Though after reading the drm code once more, the current profile indicates that a substantial amount of time is wasted by batch management. Looks like we could recover around 25% of the performance on this synthetic benchmark [for a fastish g45] with a switch to a cairo-drm style batch manager. [Though I still argue that if the goal is polygon rasterisation, the benchmark is inadequate. ;-]
miCompositeRects(), need I say more? Before: 11997.466216 Ops/s; rects (!); 15x15 4941.396509 Ops/s; rects (!); 75x75 1660.617060 Ops/s; rects (!); 250x250 2016.944734 Ops/s; rects blended; 15x15 555.278174 Ops/s; rects blended; 75x75 177.135678 Ops/s; rects blended; 250x250 12453.686200 Ops/s; rects composition (!); 15x15 4770.370370 Ops/s; rects composition (!); 75x75 1712.837838 Ops/s; rects composition (!); 250x250 After tweaking: 39378.768844 Ops/s; rects (!); 15x15 12316.143498 Ops/s; rects (!); 75x75 3965.307203 Ops/s; rects (!); 250x250 37158.724832 Ops/s; rects blended; 15x15 13607.142857 Ops/s; rects blended; 75x75 4006.815703 Ops/s; rects blended; 250x250 17146.502836 Ops/s; rects composition (!); 15x15 6820.754717 Ops/s; rects composition (!); 75x75 2014.409222 Ops/s; rects composition (!); 250x250 Hmm.
commit cb887cfc670bf63993bd313ff33927afb8198eae Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Mar 26 09:59:51 2010 +0000 uxa: solid rects The cost of performing relocations outweigh the advantages of using the blitter for solids with lots of rectangles. References: Bug 22127 - [UXA] 50% performance regression for XRenderFillRectangles https://bugs.freedesktop.org/show_bug.cgi?id=22127 By using the 3D pipeline we improve our performance by around 4x on i945, measured by the jxbench microbenchmark, and a factor of 10x by short-cutting to the 3D pipeline for blended rectangles. Before, on a i945GME: 19982.412060 Ops/s; rects (!); 15x15 9599.131693 Ops/s; rects (!); 75x75 3803.654743 Ops/s; rects (!); 250x250 6836.743772 Ops/s; rects blended; 15x15 1443.750000 Ops/s; rects blended; 75x75 495.335821 Ops/s; rects blended; 250x250 23247.933884 Ops/s; rects composition (!); 15x15 10993.073048 Ops/s; rects composition (!); 75x75 3595.905172 Ops/s; rects composition (!); 250x250 After: 87271.145975 Ops/s; rects (!); 15x15 32347.744361 Ops/s; rects (!); 75x75 5884.177215 Ops/s; rects (!); 250x250 73500.000000 Ops/s; rects blended; 15x15 33580.882353 Ops/s; rects blended; 75x75 5858.811749 Ops/s; rects blended; 250x250 25582.317073 Ops/s; rects composition (!); 15x15 6664.728682 Ops/s; rects composition (!); 75x75 14965.909091 Ops/s; rects composition (!); 250x250 [suspicious] This has no impact on Cairo, but I have a suspicion from watching xtrace that Qt likes to blit thousands of 1x1 rectangles with the same colour. However, we are still around 2-3x slower than the reported figures for EXA! That's about as fast as I can make it for the time being. Not sure where to go next...
Clemens, do you remember on which system you originally benchmarked? I'm trying to understand whether UXA performance is really 2-3x lower than EXA.
Hi Chris, Thanks a lot for still working on that stuff. Orginally I benchmarked on my i945GM based laptop with Fedora-8 running intel-2.2.1+EXA+Xorg-1.3, unfourtunatly I deleted that installation some time ago. Running 2.6.32.11 + intel-2.11 I currently get: 28354.237316 Ops/s; rects (!); 15x15 16570.028991 Ops/s; rects (!); 75x75 4872.900679 Ops/s; rects (!); 250x250 Yes, unfourtunatly QT's X11 backend is in a very bad shape ... doing things much worse than mis-using rects for shapes ;)
(In reply to comment #13) > Hi Chris, > > Thanks a lot for still working on that stuff. > Orginally I benchmarked on my i945GM based laptop with Fedora-8 running > intel-2.2.1+EXA+Xorg-1.3, unfourtunatly I deleted that installation some time > ago. That's actually a good tip for building a retro system, thanks. :)
with 2.6.34 + intel-2.11.901 I get: 176116.033723 Ops/s; rects (!); 15x15 63137.245405 Ops/s; rects (!); 75x75 6650.071638 Ops/s; rects (!); 250x250 Actually even better than the results I reported for EXA :) Thanks a lot, Clemens
Clemens, I think this is about as good it gets. I've a few more general tweaks that will improve performance further, but I am not going to try and specifically improve this case. Please do call to my attention to any area you think is still underperforming [on i915, for i965 please wait until I've at least had a chance to fix it first]...and yes I've already fixed the code paths drawn to my attention via Phoronix! ;-) Marking as fixed since performance seems on a par with EXA..
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.