18011 – [x86-64] Slow performance with compositing (x11perf -aa10text with compiz)

Bug 18011 - [x86-64] Slow performance with compositing (x11perf -aa10text with compiz)

Summary: [x86-64] Slow performance with compositing (x11perf -aa10text with compiz)

Status:	VERIFIED FIXED

Alias:	None

Product:	xorg
Classification:	Unclassified
Component:	Driver/intel (show other bugs)
Version:	unspecified
Hardware:	Other All

Importance:	medium normal
Assignee:	Carl Worth
QA Contact:	Xorg Project Team

URL:
Whiteboard:
Keywords:

Duplicates (1):	21681 (view as bug list)
Depends on:
Blocks:

Reported:	2008-10-10 15:51 UTC by Ben Gamari
Modified:	2010-06-01 10:55 UTC (History)
CC List:	3 users (show)

See Also:
i915 platform:
i915 features:

Attachments
Xorg log (83.90 KB, text/plain) 2008-10-13 22:35 UTC, Ben Gamari	no flags	Details
Profile while running compiz and x11perf (230.31 KB, application/x-bz2) 2008-10-13 22:46 UTC, Ben Gamari	no flags	Details
Profile while running x11perf without compiz (150.76 KB, application/x-bz2) 2008-10-13 23:17 UTC, Ben Gamari	no flags	Details
Profile while switching through 8 open windows with compiz' application switcher plugin (226.35 KB, application/x-bz2) 2008-10-15 14:26 UTC, Ben Gamari	no flags	Details
Profile of x11perf -aa10text with new kernel bits (298.16 KB, application/x-bz2 ) 2008-11-09 14:51 UTC, Ben Gamari	no flags	Details
Profile of scrolling in firefox with new kernel bits (305.61 KB, application/x-bz2) 2008-11-09 15:15 UTC, Ben Gamari	no flags	Details
Show Obsolete (2) View All

Description Ben Gamari 2008-10-10 15:51:53 UTC

While running with a non-compositing window manager (metacity), I can easily get 200k glyphs/sec. When running compiz, this number drops to 65k glyphs/sec. If I understand the role of a compositor correctly, there should be effectively no impact on plain X render performance. It's probably this effect that makes my composited desktop a distinctly jerky feeling at times

Comment 1 Ben Gamari 2008-10-13 22:35:27 UTC

Created attachment 19642 [details]
Xorg log

When running x11perf -aa10text with compiz, Xorg takes 80%+ CPU time. I'll try doing some profiling.

Comment 2 Ben Gamari 2008-10-13 22:46:36 UTC

Created attachment 19643 [details]
Profile while running compiz and x11perf

Taken with,
$ opcontrol --init
$ opcontrol --vmlinux=/usr/lib/debug/lib/modules/2.6.27-3.fc10.x86_64/vmlinux
$ opcontrol -c 15
$ opcontrol --start
$ x11perf -aa10text
$ x11perf -aa10text
$ opcontrol --stop
$ opreport -c%

Comment 3 Ben Gamari 2008-10-13 22:57:08 UTC

After taking another profile from a double-run of x11perf -aa10text, this time with,

$ opcontrol --reset
$ opcontrol -i Xorg
$ opcontrol --start
$ x11perf -aa10text
$ x11perf -aa10text
$ opcontrol --stop
$ opreport -c -t 10 > hi

it became stunningly clear that lots of time is being spent in free/reserve_memtype:

CPU: Core 2, speed 800 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  %        app name                 symbol name
-------------------------------------------------------------------------------
  218716   98.3931  vmlinux                  iounmap
215649   21.7055  vmlinux                  free_memtype
  215649   97.0020  vmlinux                  free_memtype [self]
-------------------------------------------------------------------------------
  217416   98.6743  vmlinux                  __ioremap_caller
214371   21.5768  vmlinux                  reserve_memtype
  214371   97.2491  vmlinux                  reserve_memtype [self]
-------------------------------------------------------------------------------

Comment 4 Ben Gamari 2008-10-13 23:17:04 UTC

Created attachment 19646 [details]
Profile while running x11perf without compiz

Here is a profile while running x11perf -aa10text without compiz in a clean xorg session. There still is high CPU usage in this case (~60-70%), however performance is much better (200kglyphs/sec, as opposed to 65k with compiz, as stated earlier). Moreover, most of the samples in this profile are expected, with dixLookupPrivate showing up first with exaBufferGlyph a close third.

$ opcontrol --init
$ opcontrol -i any
$ opcontrol --vmlinux=/usr/lib/debug/lib/modules/2.6.27-3.fc10.x86_64/vmlinux
$ opcontrol -c 15
$ opcontrol --start
$ x11perf -aa10text
$ x11perf -aa10text
$ opcontrol --stop
$ opreport -%c

Comment 5 Carl Worth 2008-10-14 09:34:34 UTC

Hi Ben,

Thanks for the bug report.

I see from your X log (and your oprofile command line) that you appear to be running with the following:

GPU:       GM965
X server:  1.5.2
intel_drv: 2.4.97
Linux:     2.6.27-3.fc10.x86_64

That's pretty similar to my normal setup except that I also have a GEM-enabled kernel and an X server from the master branch, (but I can also switch out either of those easily enough).

I don't generally run with compiz, so I'll try that and report back whether I see the same bug or not.

-Carl

Comment 6 Ben Gamari 2008-10-14 10:04:10 UTC

I believe the stock Rawhide kernel is also GEM enabled. Is there any indication of this in xorg.log? I'd definitely be happy to do any further profiling or debugging. Just let me know what I can do to help.

Comment 7 Ben Gamari 2008-10-15 14:26:41 UTC

Created attachment 19671 [details]
Profile while switching through 8 open windows with compiz' application switcher plugin

Out of curiosity, I took another profile on a case that has also been quite slow with compiz. Specifically, rotating through the window list of the application switcher plugin (i.e. Alt+Tab) with a moderate number of windows open (5 to 10, the impact is far greater if a few are Firefox windows). In this case, you case the "fade" from transparent to opaque when a new is rotated can be obviously seen (probably tenths of seconds between frames). This profile looks very similar to the profiles I attached earlier in its heavy use of *_memtype. Looks like there is one heck of a bottleneck there.

Comment 8 Ben Gamari 2008-10-15 14:29:07 UTC

Just to clarify, the above profile was taken with,
$ opcontrol --reset
$ opcontrol --start
Sit pressing Alt+Tab for a few minutes
$ opcontrol --stop
$ opreport -%c

Comment 9 Carl Worth 2008-10-16 10:57:06 UTC

(In reply to comment #6)
> I believe the stock Rawhide kernel is also GEM enabled.

Interesting. I don't know details about what the Rawhide kernel has. The recommended kernel to use for GEM with the Intel drivers is the drm-intel-next kernel. It is available from the following git repository:

git://git.kernel.org/pub/scm/linux/kernel/git/anholt/drm-intel

as the drm-intel-next branch.

Meanwhile, the free_memtype and reserve_memtype bootlenecks in the profile suggest that your kernel was built without high-mem, (you'll want CONFIG_HIGHMEM=y and CONFIG_HIGHMEM4G=y).

So please rebuild your kernel with those options and report back, (or else complain to the supplier of your kernel).

Thanks,

-Carl

Comment 10 Ben Gamari 2008-10-16 11:39:26 UTC

(In reply to comment #9)
> (In reply to comment #6)
> > I believe the stock Rawhide kernel is also GEM enabled.
> 
> Interesting. I don't know details about what the Rawhide kernel has. The
> recommended kernel to use for GEM with the Intel drivers is the drm-intel-next
> kernel. It is available from the following git repository:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/anholt/drm-intel
> 
> as the drm-intel-next branch.
> 
> Meanwhile, the free_memtype and reserve_memtype bootlenecks in the profile
> suggest that your kernel was built without high-mem, (you'll want
> CONFIG_HIGHMEM=y and CONFIG_HIGHMEM4G=y).
> 
> So please rebuild your kernel with those options and report back, (or else
> complain to the supplier of your kernel).
> 
> Thanks,
> 
> -Carl
> 

I'll give the branch you cited a try after my class in an hour or so. However, I looked at the config of my kernel and was unable to find any mention of CONFIG_HIGHMEM,

$ cat /boot/config-`uname -r` | grep HIGHMEM
$

I found this quite odd (I know the option used to be under "Processor Types and Features" in menuconfig), so I downloaded the stock 2.6.27.1 tarball and tried looking for it. Once again, I was unable to find it in any Kconfig, although plenty of #ifdefs showed up.

The fact that I'm on x86-64 is probably important to mention here (sorry about not making that clearer), so I can definitely see how this option would be irrelevant (although I admittedly don't know a whole lot about what the option itself does).

Comment 11 Carl Worth 2008-10-16 11:48:05 UTC

(In reply to comment #10)
> The fact that I'm on x86-64 is probably important to mention here (sorry about
> not making that clearer), so I can definitely see how this option would be
> irrelevant (although I admittedly don't know a whole lot about what the option
> itself does).

Ah, yes.

It's not at all surprising that the option isn't present on x86-64. So we might just be doing the wrong thing on such systems. We'll have to do some investigation and see what we should be doing differently. Thank you very much for the bug report.

-Carl

Comment 12 Ben Gamari 2008-10-16 12:47:21 UTC

(In reply to comment #11)
> Ah, yes.
> 
> It's not at all surprising that the option isn't present on x86-64. So we might
> just be doing the wrong thing on such systems. We'll have to do some
> investigation and see what we should be doing differently. Thank you very much
> for the bug report.
> 
> -Carl
> 

No worries, let me know when I can test something.

Comment 13 Ben Gamari 2008-10-27 09:13:44 UTC

I'm assuming the recent io mapping work I've been reading about on dri-devel is germane here. Has this been submitted to Linus yet? Is it in a testable state? What is the plan for this as far as testing? I'll probably have to pull a 2.6.28 prerelease kernel, right?

Comment 14 Ben Gamari 2008-11-09 10:38:02 UTC

Things are much better now, although I don't think we're entirely free of a compositing performance penalty. While compiz performance is dramatically better (usable as my primary window manager for the first time since its initial release), I'm still only getting 105k glyphs/second at best in -aa10text. For this reason (I think), firefox scrolling is still a bit sluggish. Is this expected? Are there still optimizations to be done?

Comment 15 Ben Gamari 2008-11-09 14:51:06 UTC

Created attachment 20168 [details]
Profile of x11perf -aa10text with new kernel bits

For comparison's sake, here is a profile running x11perf -aa10text under compiz with the latest rawhide kernel (which incorporates the new memory mapping kernel bits). I ran 3 runs of -aa10text, each of which averaged at about 90k glyphs/sec

Comment 16 Ben Gamari 2008-11-09 15:15:05 UTC

Created attachment 20169 [details]
Profile of scrolling in firefox with new kernel bits

Here is an attempt at getting a profile for the case of scrolling in firefox. While scrolling does this bug's page, firefox used nearly 20% of a core. Unfortunately, looking at this profile I see no instances of firefox. Regardless, I'm attaching it anyways in hopes that it will help someone.

Comment 17 Michel Dänzer 2008-11-10 02:57:28 UTC

(In reply to comment #14)
> Things are much better now, although I don't think we're entirely free of a
> compositing performance penalty.

FWIW, it's unrealistic to expect that, as compositing does incur at least one additional copy for making updates visible on the screen.

Comment 18 Ben Gamari 2008-11-10 06:35:31 UTC

(In reply to comment #17)
> (In reply to comment #14)
> > Things are much better now, although I don't think we're entirely free of a
> > compositing performance penalty.
> 
> FWIW, it's unrealistic to expect that, as compositing does incur at least one
> additional copy for making updates visible on the screen.
> 

True, but I meant "free" pretty loosely. I was under the impression that this copy (texture mapping) occurs on the 3d unit whereas text rendering was a 2d operation (on dedicated hardware). Is this incorrect? Is the chip bandwidth starved?

I'm getting 150k glyphs/second without compiz in -aa10text. This means that compiz gives a 30% hit in text rendering performance. This is manifested in much smoother scroll performance in firefox when not composited.

Regardless, thanks a ton for your work.

Comment 19 Ben Gamari 2008-11-17 15:12:05 UTC

An update:

With xf86-video-intel from git, moderately recent kernel (with GEM of course), I now get ~240k glyphs/second with metacity and 205k glyphs/second with compiz. I'll do another set of profiles if needed.

Comment 20 Ben Gamari 2009-04-17 08:35:44 UTC

Just an update:

Today I re-ran aa10text both with metacity and compiz and found the following,

metacity: 235 kglyphs/sec (Woo hoo!)
compiz:   180 kglyphs/sec (Doh!)

Seems like we've regressed a little since November. This is with,
$ xorg-versions.sh 
Xorg components as of Fri Apr 17 11:35:26 EDT 2009
drm: 	1173e7abdcdf758a2403ce921076080c6672c054
xf86-video-intel: 	ebb8d6a13a18138b31ad119be2d3807a1e4010b3
mesa: 	e704de8cb62092f7402cfe99064fcd692e492086
xserver: 	49bd35c28245d2261f17887f03a23deddf57d1e9

Linux mercury.localdomain 2.6.29work #34 SMP PREEMPT Thu Apr 16 12:25:16 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux

Comment 21 Carl Worth 2009-07-31 10:26:47 UTC

*** Bug 21681 has been marked as a duplicate of this bug. ***

Comment 22 Carl Worth 2009-07-31 10:28:52 UTC

Hi Ben,

Something I'd like to do now is to redo some testing of the overhead imposed by compiz, but with a real-world test case like cairo-perf-trace rather than just "x11perf -aa10text". It occurs to me that we may need to tweak cairo-perf-trace slightly to actually force it to go through paths that involve the compositing manager.

Should be interesting stuff.

-Carl

Comment 23 Carl Worth 2009-08-17 12:48:24 UTC

A week or two ago I did some more investigation of this bug.

I convinced Chris Wilson to make a change so that we can use:

	csi-replay --xlib

to replay a trace to a window, (rather than to just an offscreen
pixmap). With that, I was able to run a firefox trace and measure the
overhead of running with compiz.

I neglected to record the numbers I obtained from that testing,
(though I should be able to get them again easily), but the question
becomes: How small should we expect the overhead to be before we can
consider it adequate?

-Carl

Comment 24 Carl Worth 2009-12-02 08:55:52 UTC

I'm lowering the priority of this bug report since much of the performance
regression when compositing was eliminated long ago.

I'll leave the bug report open in case anybody wants to carefully look at
and characterize the remaining performance difference to decide whether
it's expected or not.

-Carl

Comment 25 zhao jian 2009-12-02 21:02:42 UTC

I tested with aa10text on GM45 32bit, G45 64bit and GM965 64bit platforms and find there is still some regression. 

1.G45b 64bit:
under X :  302000 
gnome without compiz: 305000
gnome with compiz: 254000
2.GM45 32bit      
under X : 256000
gnome without compiz: 252000
gnome with compiz: 145000 
3.GM965 64bit  
under X: 250000
gnome without compiz： 251000
gnome with compiz: 161000

Comment 26 Chris Wilson 2010-05-31 11:59:23 UTC

To clarify, we do expect a compositing window manager to impact upon performance since it has the role of "fixing" damaged regions on the screen. The effect of enabling compositing is for the app to render into a backing pixmap, for which X then sends damage events to the compositing manager, which then decides how to update the damaged region of the backing pixmap with reference to the composited desktop.

However, what we do not expect is for this to be only a third of the speed of the non-composited.

Currently, the impact of a composited window manger:-
non-composited:        879 kglyphs/s
compiz:                452 kglyphs/s
mutter (gnome-shell):  728 kglyphs/s

In this case, it would seem the residual bug lies within compiz (and I know from recent patches, mutter performance has further improved).

Comment 27 zhao jian 2010-06-01 01:02:46 UTC

I tested on two Pinetrails, one is with meego 0.9(mutter in it) and another with Fedora12(with compiz). And there are about 40% overhead on Pinetrail and 10% on Piketon, which may be acceptable. 
pinetrail with meego:
non-composited:        600000.0/sec          
mutter (gnome-shell):  373000.0/sec
pinetrail with F12:
non-composited:        777000.0/sec
compiz:    490000.0/sec
Piketon: 
non-composited:    349000.0/sec
compiz: 302000.0/sec

Comment 28 Chris Wilson 2010-06-01 10:55:20 UTC

The 40% overhead case for Meego isn't good enough and is what prompted Robert Bragg to fix it. ;-) So we should see Meego improve, but it is difficult to quantify what the acceptable overhead is.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.