2727 – performance anomalies with spantmp2 / readpix

Bug 2727 - performance anomalies with spantmp2 / readpix

Summary: performance anomalies with spantmp2 / readpix

Status:	RESOLVED WONTFIX

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Mesa core (show other bugs)
Version:	git
Hardware:	x86 (IA32) Linux (All)

Importance:	high minor
Assignee:	mesa-dev
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2005-03-14 13:37 UTC by Roland Scheidegger
Modified:	2011-07-22 10:08 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments

Description Roland Scheidegger 2005-03-14 13:37:47 UTC

Not a bug per se, but there are multiple performance anomalies when using the
optimized read span functions from the spantmp2 template.

1) performance is much lower on a radeon 9000pro, AGP 4x, Athlon 64 3200+ (run
in 32bit mode), Via K8T800 than on a radeon 7200sdr, AGP 2x (albeit overclocked
to 89Mhz), Celeron Tualatin (1.33Ghz), i440BX. With the A64 system, I
consistently get 2.8MPix/s no matter if SSE2, MMX or generic path is used,
whereas on the good old Celeron Tualatin I get over 8MPix/s. Could be due to the
way AGP is handled on the A64 (with its IOMMU) but I'd like some more
explanation what's going on at least (if it's not "fixable").

2) On the Tualatin, the SSE path is consistently slower than the MMX one. The
difference is not drastic however, but measureable (4% or so).

3) moving the readpix window around just by a pixel can have a DRASTIC impact on
performance. If the window is at the same place, the score is always (more or
less) the same, but move it slightly and the score changes radically (and I've
disabled color tiling just to be really sure it's not due to that). When X is
set to 24bit, the difference is only around 10%, but with 16bit color I got
results ranging from 4MPix/s to 15MPix/s (only with MMX path, there is no SSE
path, and the generic path was dead-slow but consistent at around 2MPix/s). And
it's not really a "range" of results, but rather 3 sets of results: depending on
where the window is, I got a value around 4.3MPix/s, around 9.5MPix/s or
14.5MPix/s (all +- 0.5MPix/s), but never a value in-between. Are there aligning
issues?

4) The optimized generic x86 path is very slow on the tualatin. I've done some
measurements where I disabled this optimization and it was over 3 times faster.
sse: 8.24 MPix/s
mmx: 8.50 MPix/s
generic: 1.49 MPix/s
without optimized asm: 4.79 MPix/s
The instructions used (swapb and rorl) didn't look suspicious, I replaced them
with a single nop and it did not get faster at all. So the reason it's slower is
apparently simply because of the use of inline-assembly, I'm not sure does the
compiler need to insert some code when it encounters the _asm directive?
Otherwise it's probably because the insertion of _asm means the compiler can't
do some optimizations it could otherwise (loop unrolling?). It's probably not
much of a problem, since few x86 machines we care about don't have at least mmx...

Comment 1 Adam Jackson 2009-08-24 12:23:06 UTC

Mass version move, cvs -> git

Comment 2 Ian Romanick 2011-07-22 10:08:55 UTC

Since nothing has been done on this bug in six years, I'm sure we'll get to it real soon now. :)

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.