12216 – Mesa V7.0.1 on x86_64 crashes while rendering a NURBS surface

Bug 12216 - Mesa V7.0.1 on x86_64 crashes while rendering a NURBS surface

Summary: Mesa V7.0.1 on x86_64 crashes while rendering a NURBS surface

Status:	RESOLVED FIXED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Mesa core (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) other

Importance:	medium normal
Assignee:	mesa-dev
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2007-08-29 15:10 UTC by George Alverson
Modified:	2016-02-04 11:21 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments
stack trace (3.26 KB, text/plain) 2007-08-29 15:12 UTC, George Alverson	Details
Test object containing four NURBS surfaces (13.58 KB, text/plain) 2007-08-29 15:14 UTC, George Alverson	Details
View of test object properly rendered (9.27 KB, image/png) 2007-08-29 15:17 UTC, George Alverson	Details
View All

Description George Alverson 2007-08-29 15:10:23 UTC

Crash dump: to be appended.

Description of surface: to be appended.

Problem: segmentation fault at p4_general_loop () at x86-64/xform4.S:67. 

No problem: Mesa V7.0.1 on x86_32.

System: Modified RH: Linux d0rel7.fnal.gov 2.6.9-22.0.2.ELsmp #1 SMP Wed Jan 18 14:39:15 CST 2006 x86_64 x86_64 x86_64 GNU/Linux

Compiler: gcc 3.4.4

Comment 1 George Alverson 2007-08-29 15:12:42 UTC

Created attachment 11324 [details]
stack trace

Comment 2 George Alverson 2007-08-29 15:14:46 UTC

Created attachment 11325 [details]
Test object containing four NURBS surfaces

Comment 3 George Alverson 2007-08-29 15:17:36 UTC

Created attachment 11326 [details]
View of test object properly rendered

Comment 4 Roland Scheidegger 2007-08-30 06:04:47 UTC

Looks to me like the src address may just not be 128bit aligned, which is required for movaps. In this case, using movups instead should fix this. Could you try if this works correctly? (though you might hit the same problem elsewhere, there's lots of similar code around so I wouldn't be surprised if it's wrong in other places too.) A better solution might be to actually change the code so it guarantees it's 128bit aligned, if possible (using aligned mallocs, though I'm right now not quite sure if there even is some compiler-independent solution to do this if the variable is on the stack instead?).

Comment 5 George Alverson 2007-08-30 06:48:20 UTC

Changing from movaps to movups in xform4 cured the crash. There is still a problem correctly displaying the surface (it disappears once it is rotated), but that'll be a different bug.

Should I mark this as FIXED or wait until the repository is patched?

-Thanks.

Comment 6 Brian Paul 2007-08-30 07:43:42 UTC

I've checked in the movups change.

If you can provide a complete program to reproduce the rendering problem, I'll look into it.

Comment 7 Roland Scheidegger 2007-08-30 09:09:07 UTC

Someone more familiar with x86 assembly should probably take another look, there could be more similar issues. Also, I think manual padding is evil (can't we just ditch it completely, at least those not in loops?), and it especially gets strange if the comment says "/* manual align += 3 */" but it adds 4 bytes...

Comment 8 Patrick Baggett 2007-08-30 12:52:01 UTC

(In reply to comment #4)
> Looks to me like the src address may just not be 128bit aligned, which is
> required for movaps. In this case, using movups instead should fix this. Could
> you try if this works correctly? (though you might hit the same problem
> elsewhere, there's lots of similar code around so I wouldn't be surprised if
> it's wrong in other places too.) A better solution might be to actually change
> the code so it guarantees it's 128bit aligned, if possible (using aligned
> mallocs, though I'm right now not quite sure if there even is some
> compiler-independent solution to do this if the variable is on the stack
> instead?).


Manually aligning a stack variable in a compiler independent way is fairly simply, but like an aligned malloc() implementation, it takes slightly more memory and setup that normally a compiler would do (e.g. GCC's __attribute__). The basic idea is that a 128 bit value requires 16 bytes of stack space, and assuming the worst case that its address mod 16  is 1, up to 15 extra bytes of padding space is needed. To be round, I would suggest just using 32 bytes. To find the aligned address, you need:

char data[32];
unsigned long addr; /* Hopefully same size as ptr in LP64 and ILP32 models */
float* aligned_ptr;

addr = (unsigned long)&data[0];

addr = addr + (16 - (addr & 0x0F)); /* Align the address */

aligned_ptr = (float*)addr; /* Currently have a single 128-bit variable aligned */


Replacing movaps with movups has severe performance hits in heavy memory traffic areas, and by forcing the lowest common denominator, you cannot use the x86 memory reference to relieve register pressure since SSE/2/3 operations that use memory operands must also be aligned just as movaps must. I don't suggest that every instance be replaced, rather I would imagine that getting aligned data would be the optimal solution. In fact, I don't know of a single machine architecture that doesn't like aligned data, but I can certainly think of a few who choke on unaligned data. It would probably be better to place a debug assert() checking the addresses for alignment, or better yet, write a general case version which does so inside of the assembly itself.
e.g:

test rax, 15 ;Some pointer in rax
jz aligned_loop

unaligned_loop:
[...code...]
ret
aligned_loop:
[...code...]


I pretty famaliar with x86 assembly language, especially SIMD instruction sets, perhaps I could help a bit?

Patrick Baggett

Comment 9 Roland Scheidegger 2007-08-30 16:23:35 UTC

(In reply to comment #8)
> > the code so it guarantees it's 128bit aligned, if possible (using aligned
> > mallocs, though I'm right now not quite sure if there even is some
> > compiler-independent solution to do this if the variable is on the stack
> > instead?).
> 
> 
> Manually aligning a stack variable in a compiler independent way is fairly
> simply, but like an aligned malloc() implementation, it takes slightly more
> memory and setup that normally a compiler would do (e.g. GCC's __attribute__).
> The basic idea is that a 128 bit value requires 16 bytes of stack space, and
> assuming the worst case that its address mod 16  is 1, up to 15 extra bytes of
> padding space is needed. To be round, I would suggest just using 32 bytes. To
> find the aligned address, you need:
> 
> char data[32];
> unsigned long addr; /* Hopefully same size as ptr in LP64 and ILP32 models */
> float* aligned_ptr;
> 
> addr = (unsigned long)&data[0];
> 
> addr = addr + (16 - (addr & 0x0F)); /* Align the address */
> 
> aligned_ptr = (float*)addr; /* Currently have a single 128-bit variable aligned
> */
Well, what I really meant was to do it in a compiler-independent way but without any additional effort. I'll take that as a "no" ;-).

> Replacing movaps with movups has severe performance hits in heavy memory
> traffic areas, and by forcing the lowest common denominator, you cannot use the
> x86 memory reference to relieve register pressure since SSE/2/3 operations that
> use memory operands must also be aligned just as movaps must. I don't suggest
> that every instance be replaced, rather I would imagine that getting aligned
> data would be the optimal solution.
Register pressure shouldn't be too bad, since the code in question is used on x86_64 only, the x86 version uses a 32bit mov there so no problem. And at least the destination is always aligned.

> I pretty famaliar with x86 assembly language, especially SIMD instruction sets,
> perhaps I could help a bit?
> 
> Patrick Baggett
If you want to give it a look that'll be great. There's not that much sse assembly actually (more functions are optimized "only" for x86 without sse, as it wasn't particularly easy to parallelize it (IIRC at least also partly due to the limited shuffle capabilities). And, I think the x86_64 optimizing version is still using some 3dnow opcodes, which, while maybe more suitable to the code, certainly poses some problems for intel chips...

Comment 10 Patrick Baggett 2007-09-06 18:04:04 UTC

I have been looking at x86-64/xform.S

_mesa_x86_64_transform_points4_general code segment is pretty shaky in a couple of places. 
1) 3DNow! prefetch instructions are used when SSE includes analogous instructions and all x86-64 processors include SSE.
2) The stride loaded into %eax register (line 41) doesn't have to be 16, and therefore even if the pointer was aligned from the compiler, the 'movaps' instruction could not be used unless the stride's lower 3 bits are zero (i.e. sum of values >= 16)
3) 0xf3 before ret (??)
4) no-op padding (has anyone tested if this actually helps performance?)
5) Aligned moves used on the matrix data.

Is the stride always 16 (4x32) for V4F vector values?

Comment 11 Roland Scheidegger 2007-09-08 04:38:22 UTC

(In reply to comment #10)
> I have been looking at x86-64/xform.S
> 
> _mesa_x86_64_transform_points4_general code segment is pretty shaky in a couple
> of places. 
> 1) 3DNow! prefetch instructions are used when SSE includes analogous
> instructions and all x86-64 processors include SSE.
Yes, those should be replaced. But other functions in the same file use other 3DNow! instructions, so only replacing those doesn't really help. (And, shouldn't you actually prefetch further ahead, at least by a cache line for optimal performance?)

> 2) The stride loaded into %eax register (line 41) doesn't have to be 16, and
> therefore even if the pointer was aligned from the compiler, the 'movaps'
> instruction could not be used unless the stride's lower 3 bits are zero (i.e.
> sum of values >= 16)
Ok. Guess we need the movups then. Depending where the stride comes from it is possible it is a multiple of 16, but I haven't looked at it.

> 3) 0xf3 before ret (??)
No idea there... That's just another case of no-op padding, but why is it needed?

> 4) no-op padding (has anyone tested if this actually helps performance?)
Possibly the original author, but I'm sure noone else.

> 5) Aligned moves used on the matrix data.
> 
> Is the stride always 16 (4x32) for V4F vector values?
If this is a mesa matrix, then this should be true. The matrix constructor uses 16 byte alignment (_math_matrix_ctr).

As I said, patches welcome :-).

Comment 12 Timothy Arceri 2016-02-04 11:21:17 UTC

The original problem was fixed in 2007 so closing the bug.

Another issue pointed out here was finally fixed today \0/ see bug 27512

> 1) 3DNow! prefetch instructions are used when SSE includes analogous instructions and all x86-64 processors include SSE.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.