While running with a non-compositing window manager (metacity), I can easily get 200k glyphs/sec. When running compiz, this number drops to 65k glyphs/sec. If I understand the role of a compositor correctly, there should be effectively no impact on plain X render performance. It's probably this effect that makes my composited desktop a distinctly jerky feeling at times
Created attachment 19642 [details] Xorg log When running x11perf -aa10text with compiz, Xorg takes 80%+ CPU time. I'll try doing some profiling.
Created attachment 19643 [details] Profile while running compiz and x11perf Taken with, $ opcontrol --init $ opcontrol --vmlinux=/usr/lib/debug/lib/modules/2.6.27-3.fc10.x86_64/vmlinux $ opcontrol -c 15 $ opcontrol --start $ x11perf -aa10text $ x11perf -aa10text $ opcontrol --stop $ opreport -c%
After taking another profile from a double-run of x11perf -aa10text, this time with, $ opcontrol --reset $ opcontrol -i Xorg $ opcontrol --start $ x11perf -aa10text $ x11perf -aa10text $ opcontrol --stop $ opreport -c -t 10 > hi it became stunningly clear that lots of time is being spent in free/reserve_memtype: CPU: Core 2, speed 800 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples % app name symbol name ------------------------------------------------------------------------------- 218716 98.3931 vmlinux iounmap 215649 21.7055 vmlinux free_memtype 215649 97.0020 vmlinux free_memtype [self] ------------------------------------------------------------------------------- 217416 98.6743 vmlinux __ioremap_caller 214371 21.5768 vmlinux reserve_memtype 214371 97.2491 vmlinux reserve_memtype [self] -------------------------------------------------------------------------------
Created attachment 19646 [details] Profile while running x11perf without compiz Here is a profile while running x11perf -aa10text without compiz in a clean xorg session. There still is high CPU usage in this case (~60-70%), however performance is much better (200kglyphs/sec, as opposed to 65k with compiz, as stated earlier). Moreover, most of the samples in this profile are expected, with dixLookupPrivate showing up first with exaBufferGlyph a close third. $ opcontrol --init $ opcontrol -i any $ opcontrol --vmlinux=/usr/lib/debug/lib/modules/2.6.27-3.fc10.x86_64/vmlinux $ opcontrol -c 15 $ opcontrol --start $ x11perf -aa10text $ x11perf -aa10text $ opcontrol --stop $ opreport -%c
Hi Ben, Thanks for the bug report. I see from your X log (and your oprofile command line) that you appear to be running with the following: GPU: GM965 X server: 1.5.2 intel_drv: 2.4.97 Linux: 2.6.27-3.fc10.x86_64 That's pretty similar to my normal setup except that I also have a GEM-enabled kernel and an X server from the master branch, (but I can also switch out either of those easily enough). I don't generally run with compiz, so I'll try that and report back whether I see the same bug or not. -Carl
I believe the stock Rawhide kernel is also GEM enabled. Is there any indication of this in xorg.log? I'd definitely be happy to do any further profiling or debugging. Just let me know what I can do to help.
Created attachment 19671 [details] Profile while switching through 8 open windows with compiz' application switcher plugin Out of curiosity, I took another profile on a case that has also been quite slow with compiz. Specifically, rotating through the window list of the application switcher plugin (i.e. Alt+Tab) with a moderate number of windows open (5 to 10, the impact is far greater if a few are Firefox windows). In this case, you case the "fade" from transparent to opaque when a new is rotated can be obviously seen (probably tenths of seconds between frames). This profile looks very similar to the profiles I attached earlier in its heavy use of *_memtype. Looks like there is one heck of a bottleneck there.
Just to clarify, the above profile was taken with, $ opcontrol --reset $ opcontrol --start Sit pressing Alt+Tab for a few minutes $ opcontrol --stop $ opreport -%c
(In reply to comment #6) > I believe the stock Rawhide kernel is also GEM enabled. Interesting. I don't know details about what the Rawhide kernel has. The recommended kernel to use for GEM with the Intel drivers is the drm-intel-next kernel. It is available from the following git repository: git://git.kernel.org/pub/scm/linux/kernel/git/anholt/drm-intel as the drm-intel-next branch. Meanwhile, the free_memtype and reserve_memtype bootlenecks in the profile suggest that your kernel was built without high-mem, (you'll want CONFIG_HIGHMEM=y and CONFIG_HIGHMEM4G=y). So please rebuild your kernel with those options and report back, (or else complain to the supplier of your kernel). Thanks, -Carl
(In reply to comment #9) > (In reply to comment #6) > > I believe the stock Rawhide kernel is also GEM enabled. > > Interesting. I don't know details about what the Rawhide kernel has. The > recommended kernel to use for GEM with the Intel drivers is the drm-intel-next > kernel. It is available from the following git repository: > > git://git.kernel.org/pub/scm/linux/kernel/git/anholt/drm-intel > > as the drm-intel-next branch. > > Meanwhile, the free_memtype and reserve_memtype bootlenecks in the profile > suggest that your kernel was built without high-mem, (you'll want > CONFIG_HIGHMEM=y and CONFIG_HIGHMEM4G=y). > > So please rebuild your kernel with those options and report back, (or else > complain to the supplier of your kernel). > > Thanks, > > -Carl > I'll give the branch you cited a try after my class in an hour or so. However, I looked at the config of my kernel and was unable to find any mention of CONFIG_HIGHMEM, $ cat /boot/config-`uname -r` | grep HIGHMEM $ I found this quite odd (I know the option used to be under "Processor Types and Features" in menuconfig), so I downloaded the stock 2.6.27.1 tarball and tried looking for it. Once again, I was unable to find it in any Kconfig, although plenty of #ifdefs showed up. The fact that I'm on x86-64 is probably important to mention here (sorry about not making that clearer), so I can definitely see how this option would be irrelevant (although I admittedly don't know a whole lot about what the option itself does).
(In reply to comment #10) > The fact that I'm on x86-64 is probably important to mention here (sorry about > not making that clearer), so I can definitely see how this option would be > irrelevant (although I admittedly don't know a whole lot about what the option > itself does). Ah, yes. It's not at all surprising that the option isn't present on x86-64. So we might just be doing the wrong thing on such systems. We'll have to do some investigation and see what we should be doing differently. Thank you very much for the bug report. -Carl
(In reply to comment #11) > Ah, yes. > > It's not at all surprising that the option isn't present on x86-64. So we might > just be doing the wrong thing on such systems. We'll have to do some > investigation and see what we should be doing differently. Thank you very much > for the bug report. > > -Carl > No worries, let me know when I can test something.
I'm assuming the recent io mapping work I've been reading about on dri-devel is germane here. Has this been submitted to Linus yet? Is it in a testable state? What is the plan for this as far as testing? I'll probably have to pull a 2.6.28 prerelease kernel, right?
Things are much better now, although I don't think we're entirely free of a compositing performance penalty. While compiz performance is dramatically better (usable as my primary window manager for the first time since its initial release), I'm still only getting 105k glyphs/second at best in -aa10text. For this reason (I think), firefox scrolling is still a bit sluggish. Is this expected? Are there still optimizations to be done?
Created attachment 20168 [details] Profile of x11perf -aa10text with new kernel bits For comparison's sake, here is a profile running x11perf -aa10text under compiz with the latest rawhide kernel (which incorporates the new memory mapping kernel bits). I ran 3 runs of -aa10text, each of which averaged at about 90k glyphs/sec
Created attachment 20169 [details] Profile of scrolling in firefox with new kernel bits Here is an attempt at getting a profile for the case of scrolling in firefox. While scrolling does this bug's page, firefox used nearly 20% of a core. Unfortunately, looking at this profile I see no instances of firefox. Regardless, I'm attaching it anyways in hopes that it will help someone.
(In reply to comment #14) > Things are much better now, although I don't think we're entirely free of a > compositing performance penalty. FWIW, it's unrealistic to expect that, as compositing does incur at least one additional copy for making updates visible on the screen.
(In reply to comment #17) > (In reply to comment #14) > > Things are much better now, although I don't think we're entirely free of a > > compositing performance penalty. > > FWIW, it's unrealistic to expect that, as compositing does incur at least one > additional copy for making updates visible on the screen. > True, but I meant "free" pretty loosely. I was under the impression that this copy (texture mapping) occurs on the 3d unit whereas text rendering was a 2d operation (on dedicated hardware). Is this incorrect? Is the chip bandwidth starved? I'm getting 150k glyphs/second without compiz in -aa10text. This means that compiz gives a 30% hit in text rendering performance. This is manifested in much smoother scroll performance in firefox when not composited. Regardless, thanks a ton for your work.
An update: With xf86-video-intel from git, moderately recent kernel (with GEM of course), I now get ~240k glyphs/second with metacity and 205k glyphs/second with compiz. I'll do another set of profiles if needed.
Just an update: Today I re-ran aa10text both with metacity and compiz and found the following, metacity: 235 kglyphs/sec (Woo hoo!) compiz: 180 kglyphs/sec (Doh!) Seems like we've regressed a little since November. This is with, $ xorg-versions.sh Xorg components as of Fri Apr 17 11:35:26 EDT 2009 drm: 1173e7abdcdf758a2403ce921076080c6672c054 xf86-video-intel: ebb8d6a13a18138b31ad119be2d3807a1e4010b3 mesa: e704de8cb62092f7402cfe99064fcd692e492086 xserver: 49bd35c28245d2261f17887f03a23deddf57d1e9 Linux mercury.localdomain 2.6.29work #34 SMP PREEMPT Thu Apr 16 12:25:16 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
*** Bug 21681 has been marked as a duplicate of this bug. ***
Hi Ben, Something I'd like to do now is to redo some testing of the overhead imposed by compiz, but with a real-world test case like cairo-perf-trace rather than just "x11perf -aa10text". It occurs to me that we may need to tweak cairo-perf-trace slightly to actually force it to go through paths that involve the compositing manager. Should be interesting stuff. -Carl
A week or two ago I did some more investigation of this bug. I convinced Chris Wilson to make a change so that we can use: csi-replay --xlib to replay a trace to a window, (rather than to just an offscreen pixmap). With that, I was able to run a firefox trace and measure the overhead of running with compiz. I neglected to record the numbers I obtained from that testing, (though I should be able to get them again easily), but the question becomes: How small should we expect the overhead to be before we can consider it adequate? -Carl
I'm lowering the priority of this bug report since much of the performance regression when compositing was eliminated long ago. I'll leave the bug report open in case anybody wants to carefully look at and characterize the remaining performance difference to decide whether it's expected or not. -Carl
I tested with aa10text on GM45 32bit, G45 64bit and GM965 64bit platforms and find there is still some regression. 1.G45b 64bit: under X : 302000 gnome without compiz: 305000 gnome with compiz: 254000 2.GM45 32bit under X : 256000 gnome without compiz: 252000 gnome with compiz: 145000 3.GM965 64bit under X: 250000 gnome without compiz: 251000 gnome with compiz: 161000
To clarify, we do expect a compositing window manager to impact upon performance since it has the role of "fixing" damaged regions on the screen. The effect of enabling compositing is for the app to render into a backing pixmap, for which X then sends damage events to the compositing manager, which then decides how to update the damaged region of the backing pixmap with reference to the composited desktop. However, what we do not expect is for this to be only a third of the speed of the non-composited. Currently, the impact of a composited window manger:- non-composited: 879 kglyphs/s compiz: 452 kglyphs/s mutter (gnome-shell): 728 kglyphs/s In this case, it would seem the residual bug lies within compiz (and I know from recent patches, mutter performance has further improved).
I tested on two Pinetrails, one is with meego 0.9(mutter in it) and another with Fedora12(with compiz). And there are about 40% overhead on Pinetrail and 10% on Piketon, which may be acceptable. pinetrail with meego: non-composited: 600000.0/sec mutter (gnome-shell): 373000.0/sec pinetrail with F12: non-composited: 777000.0/sec compiz: 490000.0/sec Piketon: non-composited: 349000.0/sec compiz: 302000.0/sec
The 40% overhead case for Meego isn't good enough and is what prompted Robert Bragg to fix it. ;-) So we should see Meego improve, but it is difficult to quantify what the acceptable overhead is.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.