Bug 13647 - image_perf.py performance regression
Summary: image_perf.py performance regression
Status: RESOLVED FIXED
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/intel (show other bugs)
Version: 7.3 (2007.09)
Hardware: x86 (IA32) Linux (All)
: medium normal
Assignee: Carl Worth
QA Contact: Xorg Project Team
URL: http://www.awtrey.com/files/xorg-perf...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-12-13 11:29 UTC by Anthony Awtrey
Modified: 2012-07-22 12:47 UTC (History)
3 users (show)

See Also:
i915 platform:
i915 features:


Attachments
An opreport of Xorg 7.0 while running the image_perf.py (3.88 KB, text/plain)
2007-12-14 11:18 UTC, Anthony Awtrey
no flags Details
An opreport of Xorg 7.2 while running the image_perf.py (5.84 KB, text/plain)
2007-12-14 11:19 UTC, Anthony Awtrey
no flags Details
kernel oops from messages (3.29 KB, text/plain)
2009-05-04 14:46 UTC, Anthony Awtrey
no flags Details
Xorg.log from the XAA run (72.84 KB, text/plain)
2009-05-04 14:47 UTC, Anthony Awtrey
no flags Details
Xorg.log from the EXA run (72.30 KB, text/plain)
2009-05-04 14:47 UTC, Anthony Awtrey
no flags Details
Xorg.log from the UXA run (70.73 KB, text/plain)
2009-05-04 14:47 UTC, Anthony Awtrey
no flags Details

Description Anthony Awtrey 2007-12-13 11:29:43 UTC
Most display functions are faster in Xorg 7.2/7.3 than in Xorg 7.0, but some common image functions are substantially slower. I have linked (see URL) to example code that is written in C++/Qt4 and Python/Gtk2 to demonstrate this issue very clearly when comparing Debian Etch (Xorg 7.0) and the current pre-release of Debian Lenny (Xorg 7.2). Details are included in the tarball.

I reported this issue to the Xorg mailing list (See Subject: Regression Problem in Xorg 7.3 on December 12, 2007) and ran some additional tests suggested by the users there which directed this report to the Intel driver component. Specifically, running the example code using the VESA drivers or using Xvfb demonstrate that Xorg 7.2 is faster than Xorg 7.0. Running the scripts using the i810/intel driver shows that Xorg 7.2 is much slower than Xorg 7.0 when displaying overlaying images.

I am available to run any other suggested tests and will be watching this bug report for patches or suggestions.

Tony
Comment 1 Michel Dänzer 2007-12-14 09:58:21 UTC
Can you get profiles with something like sysprof or oprofile?
Comment 2 Anthony Awtrey 2007-12-14 11:18:17 UTC
Created attachment 13109 [details]
An opreport of Xorg 7.0 while running the image_perf.py
Comment 3 Anthony Awtrey 2007-12-14 11:19:00 UTC
Created attachment 13110 [details]
An opreport of Xorg 7.2 while running the image_perf.py
Comment 4 Anthony Awtrey 2007-12-14 13:04:01 UTC
I uploaded two oprofile reports while I was running the image_perf.py Python/Gtk2 example script from my tarball. This script displays an image and turns on and off a transparent overlay 500 times. It runs in about 17 seconds on Etch and about 25 seconds on Lenny (on a 1.2GHz Toughbook CF-18). Let me know if you would like something else run or more information on the issue.
Comment 5 Eric Anholt 2007-12-15 13:44:56 UTC
You should use EXA if you're limited by Render performance.  It's on by default in the 2.2 driver.
Comment 6 Anthony Awtrey 2007-12-17 06:08:52 UTC
I *am* running with EXA...

The image_perf.py benchmark running with "AccelMethod" "EXA":

  ideal@dhcp-141:~/Xorg_Performance/pygtk$ ./image_perf.py
  Time to flip the blue overlay 500 times: 25.657996

The image_perf.py benchmark running with "AccelMethod" "XAA":

  ideal@dhcp-141:~/Xorg_Performance/pygtk$ ./image_perf.py
  Time to flip the blue overlay 500 times: 81.217041   <--- LOOK

I know it is a total inconvenience to download my tarball (See URL above) and run my benchmark Python/Gtk2 utilities, but I swear you'll see exactly what I'm talking about. Using Gtk.Image to place the images and then using show_now() and hide() functions are uselessly slow. Ditto using Qt4's Qpixmap object and toggling the setVisible() function. These are probably the most common methods to display images on the two most common libraries to interface with X Window in Linux... and the performance is nearly twice as bad now!

This performance regression is absolutely killing us because it we can't run the newest Intel hardware on Xorg-7.0 with acceleration working, and because of this bug the performance on Xorg 7.2+ is so poor that it makes our application virtually unusable.
Comment 7 Gordon Jin 2007-12-18 17:47:56 UTC
So this regression is caused by XAA->EXA, instead of Xorg 7.0->7.2/7.3?
Comment 8 Anthony Awtrey 2007-12-19 05:57:26 UTC
Gordon, if you are asking me, I don't know enough about X internals to make that call. I can only point out what I'm seeing when trying to run an application that worked well in Xorg-7.0 and now doesn't in Xorg-7.2 *regardless* of which acceleration method is used.
Comment 9 Gordon Jin 2007-12-19 23:28:22 UTC
So, by running image_perf.py,
Xorg 7.0 (with XAA): 17.24
Xorg 7.3 (with XAA): 81.22
Xorg 7.3 (with EXA): 25.65

It's not caused by using EXA in intel 2.2.0 release.
Comment 10 Gordon Jin 2008-08-04 19:51:31 UTC
I guess I can take off this bug, with cworth joining.
Comment 11 Carl Worth 2008-08-07 17:13:01 UTC
Thanks Gordon,

For performance of the Intel driver, we'll be looking to Keith Packard's recent "UXA" work, (which is EXA but with the problematic migration code removed). That will be landing shortly after which I'll reevaluate this bug.

Obviously, there was a non-EXA-specific performance regression as well, here, but optimizing the XAA experience is just not interesting at this point.

And thanks, Anthony, for the test case. This should be quite useful as we do further tuning.

-Carl
Comment 12 Anthony Awtrey 2009-05-04 14:46:25 UTC
Hello again a year and a half later.

I still seeing this regression issue in the latest Xorg. Here are my versions:

Hardware

CPU: Intel(R) Pentium(R) M processor 1.20GHz
Graphics: Intel Corporation Mobile 915GM/GMS/910GML Express Graphics Controller (rev 04)

Software

Base OS: Debian Lenny (gcc 4.3.2)
Xorg:  7.4/1.6.0 from built from source
Intel: 2.6.3 built from source
Kernel 2.6.29.1 built from kernel.org source

NOTE: The regression exists on all intel graphics hardware tested up to GM45.

So I ran my image_perf.py to benchmark the image display regression again:

XAA:
debian:~/Xorg_Performance/pygtk$ DISPLAY=:0.0 ./image_perf.py
Time to flip the blue overlay 500 times: 53.669110

EXA:
debian:~/Xorg_Performance/pygtk$ DISPLAY=:0.0 ./image_perf.py
Time to flip the blue overlay 500 times: 97.717825

UXA:
debian:~/Xorg_Performance/pygtk$ DISPLAY=:0.0 ./image_perf.py
Time to flip the blue overlay 500 times: 86.004156

Wow you actually managed to speed up XAA! Too bad I hear you guys are planning on dumping out both XAA and EXA... that really fills me with dread and despair. 

During the 30 minutes I spent testing this again I still noted the font display anomaly that kept us from using EXA (outside of the performance problems). Namely EXA doesn't display the true type font correctly while XAA and UXA modes both do it just fine with the same code. (This isn't part of my regression case, just something I noticed while testing)

Also, the UXA mode crashed (as in kernel oops) two different times when I starting X after changing the AccelMethod from XAA or EXA. The UXA method seems to work fine after a clean reboot, but $diety help you if you run another method first and gunk up the kernel module.

Now, I know this is older Intel hardware, but in most cases I would think that this would also make it better understood hardware. I guess I'm wrong. I see the same things on the box with the GM45 chipset. I just wanted my performance numbers to be on the same box from the original bug report 1.5 years ago.

I don't really hold out hope any more of you guys ever fixing this (or the EXA font display issue that's been there for 1.5 years as well, or the UXA kernel oops I'm seeing today for the first time)... but I'll still include my Xorg.log files and a kernel oops for giggles.
Comment 13 Anthony Awtrey 2009-05-04 14:46:54 UTC
Created attachment 25432 [details]
kernel oops from messages
Comment 14 Anthony Awtrey 2009-05-04 14:47:13 UTC
Created attachment 25434 [details]
Xorg.log from the XAA run
Comment 15 Anthony Awtrey 2009-05-04 14:47:25 UTC
Created attachment 25435 [details]
Xorg.log from the EXA run
Comment 16 Anthony Awtrey 2009-05-04 14:47:36 UTC
Created attachment 25437 [details]
Xorg.log from the UXA run
Comment 17 Eric Anholt 2009-07-13 12:16:25 UTC
Tested on master:
Time with vesa: 7.750472
Time with UXA: 6.438378

With 2.6.3, UXA+KMS would do dri_bo_map on the screen to do the software fallback of that image (sadly, the client is using SHM pixmaps, which are very harmful if your intention is to do repeated draws of the same image).  With the current code, UXA+KMS does drm_intel_bo_map_gtt, which speeds up the software fallback significantly.
Comment 18 Anthony Awtrey 2009-07-13 13:01:18 UTC
Thanks for looking into this Eric. The problem that led to this bug isn't simply displaying the same image over and over.

The application where we first saw this issue is a touchscreen application that doesn't use standard GUI widgets. Because the operator may be using gloves, the buttons and menus must be sized appropriately. To accomplish this the application utilizes large-ish images to provide buttons and menus in a way similar to the examples included in the test case zip file.

In the application, the user is pressing a button image on a screen and expecting a menu to appear.  When this application was first developed three years ago, the Xorg version in Debian Etch was very responsive to displaying overlaying images like this.

To support newer hardware, it was necessary to use newer and newer kernel / Xorg versions. At one point the delay to overlay a menu was so significant that the user was left thinking the menu was not activated and pressed the button again and again. A quarter second or half-second delay may seem insignificant, but it can *really* impact the user experience.

This issue affected the perception of our software and made users complain that newer versions "felt slow" compared to older version even though we could demonstrate quantitatively that the actual performance of the application was faster. This has been a sore point with the customer for over a year now and has been brought up over and over at meetings.

I am glad the functions are now faster and look forward to trying out the latest version. Hopefully performance will continue to improve from now on.

I would request that at least some attention be paid to support the older Intel chips (i810/i915) better. The performance is so bad now with UXA on those platforms as to be truly useless at this point. Running a Python/GTK application I can actually watch GTK buttons and widgets draw in on the screen.
Comment 19 Eric Anholt 2009-07-13 15:08:42 UTC
We finally have a decent desktop benchmarking tool (cairo-perf-trace), and we've had major performance wins thanks to finally being able to quantify outside of microbenchmarks.  I've got a 10% win in firefox queued to land post 2.8 that will affect some general GTK rendering as well.

Sorry for the rough times -- a lot of it has been due to rewriting the whole stack, and I think we're set up to continue seeing performance wins at this point as things have settled down.  However, we still need someone to work on the GTK side to fix it to not use SHM pixmaps for reused images.
Comment 20 Chris Wilson 2012-07-21 13:45:53 UTC
Anthony, the benchmarking tool you pointed to earlier has gone. I would very much like to verify that we perform reasonably for your use case. Hopefully better late than never!
Comment 21 Anthony Awtrey 2012-07-22 01:23:11 UTC
Wow! Thanks for the follow-up, no matter how late it is. I actually found the old tarball (God, I'm such a packrat...) and stuck it back in the previous link location.

I was very happy to see that at least the pygtk code still just works (under current Debian Unstable). The qt4 code may need some tweaks, but I can't compile/run on my box right now. When I get a sec, I'll grab those old platforms and run apples-to-apples numbers on the current software load (based on Squeeze currently with a planned upgrade to Wheezy late this year).

Thanks again!
Comment 22 Chris Wilson 2012-07-22 12:17:18 UTC
This is what I currently see on t61, the most recent machine supported by UMS/XAA I have. (It's a bad choice of machine for a variety of other reasons though :-p)

crestline (965gm) image_perf.py

                ./pixmap_perf.py ./image_perf.py
etch (baseline 915):  25.002718 17.241337
lenny (baseline 915): 12.012802 25.642715
xaa:                   0.513530  6.157524
exa:                   0.443099 15.865394
uxa:                   0.433677 12.950995
sna:                   0.397831  8.594444

xaa/exa on xorg-1.5 with -intel-2.6
uxa/sna on xorg-1.12 with -intel-2.20

Judging by that I still have a small but still significant regression from the UMS/XAA heyday. Can you share any recent results from your baseline machine?
Comment 23 Chris Wilson 2012-07-22 12:47:21 UTC
Ok, the result for sna is actually bistable depending upon the order of execution (migration heuristics at play), it oscillates between: 6s and 8s for image_perf.py. So the potential to perform as well as xaa is hidden in there...


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.