I've observed a regression in Mesa 17 on s390x (zSystems). It's hard to describe. Here is a video on Youtube. https://youtu.be/xaYauE0Kn8E On the left side glxgears is running on x86_64, on the left side it's running on s390x(zSystems) - a 64bit bigendian platform. gdm and gnome-shell show a black screen. I haven't screenhoted that though (since it's boring). I git bisected Mesa and the culprit patch was From e827d9175675aaa6cfc0b981e2a80685fb7b3a74 Mon Sep 17 00:00:00 2001 From: Roland Scheidegger <sroland@vmware.com> Date: Wed, 21 Dec 2016 04:43:07 +0100 Subject: [PATCH] draw: use SoA fetch, not AoS one Now that there's some SoA fetch which never falls back, we should always get results which are better or at least not worse (something like rgba32f will stay the same). When reverse-applying this patch on Mesa 17.0.3 the regression on s390x gets fixed. Mesa has been compiled with ./configure --host=s390x-ibm-linux-gnu --build=s390x-ibm-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/lib --localstatedir=/var --sharedstatedir=/usr/com --mandir=/usr/share/man --infodir=/usr/share/info --disable-dependency-tracking --enable-gles1 --enable-gles2 --enable-dri --with-egl-platforms=x11,drm --enable-shared-glapi --enable-texture-float --enable-osmesa --enable-dri3 --enable-shader-cache --enable-gbm --enable-glx-tls --with-dri-searchpath=/usr/lib64/dri --enable-gallium-llvm --enable-llvm-shared-libs --enable-vdpau --enable-va --enable-xvmc --with-dri-drivers=swrast --with-gallium-drivers=swrast 'CFLAGS=-fmessage-length=0 -grecord-gcc-switches -fstack-protector -O2 -Wall -D_FORTIFY_SOURCE=2 -funwind-tables -fasynchronous-unwind-tables -g -DNDEBUG'
I can't say I'm terribly surprised this commit broke it. That it might not work on BE certainly came to my mind when I did this, despite my efforts... I suspect the affected format didn't work with SoA fetch before, but this wasn't much of an issue because it was commonly only used with vertex fetch, which used the AoS fetch. Albeit just before this commit, I did some changes to SoA fetch which might also potentially break BE (but you'd only see it with texture format tests and such before this commit). It is also possible the vector_justify parameter for lp_build_gather() isn't quite set correctly somewhere. Fixing llvmpipe on big-endian arch was done by Richard Sandiford <r.sandiford@uk.ibm.com> a couple years ago (introducing that vector_justify) and it worked as long as noone touched the BE affecting code I suppose... I don't think there's too many developers with BE boxes, unfortunately (well there aren't many BE boxes in the first place nowadays...). Someone would have to look into it - glxgears should be only using R32G32B32F vertex format, with some luck it's a special case (due to it being native 3 channel, npot) which can be fixed rather easily, but there might be more fundamental issues with SoA fetch on BE.
Roland, thanks a lot for your prompt reply! Very much appreciated! Seems Richard meanwhile switched companies from IBM to ARM meanwhile. I found him on Linkedin. Possibly he's now working on aarch64 (LE). So I'm afraid he has no longer access to BE machines any longer. Unfortunately I'm not familiar with llvmpipe at all. Would it be an option not to change the code there for BE, if developers have no access to such machines? Reverse-applying the commit is going to break sooner or later I'm sure. Of course I'm willing to test any proposed change/patch on s390x, but I'm not a Mesa/llvmwpipe developer per se. UNfortunately llvmpipe is needed on s390x, since it has become a requirement for modern desktops like gdm/gnome-shell. :-( I can't say how fundamental the issue is. gdm and gnome-shell just show a black screen. :-( I found glxgears as example more useful. ;-)
(In reply to Stefan Dirsch from comment #2) > Roland, thanks a lot for your prompt reply! Very much appreciated! > > Seems Richard meanwhile switched companies from IBM to ARM meanwhile. I > found him on Linkedin. Possibly he's now working on aarch64 (LE). So I'm > afraid he has no longer access to BE machines any longer. > > Unfortunately I'm not familiar with llvmpipe at all. Would it be an option > not to change the code there for BE, if developers have no access to such > machines? Reverse-applying the commit is going to break sooner or later I'm > sure. That'll be theoretically possible but I can't say I particularly like that solution. It doesn't make much sense that the fetch paths for BE and LE are completely distinct... Chances are it will break sooner or later anyway - this code really desperately wants someone who is willing to test it and keep it working on BE. (That it took 3 months until someone notices it's broken isn't a good sign...) Otherwise there's probably a build change down the road which just disables build on BE... > > Of course I'm willing to test any proposed change/patch on s390x, but I'm > not a Mesa/llvmwpipe developer per se. > > Unfortunately llvmpipe is needed on s390x, since it has become a requirement > for modern desktops like gdm/gnome-shell. :-( All the more reason why someone might want to look into it... > I can't say how fundamental the issue is. gdm and gnome-shell just show a > black screen. :-( I don't know what vertex formats these use, but yes bogus vertex fetch will make for a very bad experience (it's nearly a miracle glxgears still manages to draw something in fact I like that new look better :-)). I've taken a closer look now, and I can see some reasons why it doesn't work. That said, I never really understood the vector_justify logic, which just looks odd to me. But in the end the gather really is different for AoS and SoA (and I didn't understand the differences there neither wrt vector_justify). So, looking at R32G32B32F format (which glxgears uses) (for this format SoA vs. AoS should not actually make that much of a difference, since it doesn't require any actual conversion): The old code would have called lp_build_fetch_rgba_aos() 4 times - which would have resulted in 4 lp_build_gather with vector_justify set to TRUE, block_bits 96 and dst type of 1x128bit. The gather would have fetched 96 bits, do a ZEXT and then (due to vector_justify - this is the stuff guarded with PIPE_ARCH_BIG_ENDIAN in lp_bld_gather.c) do a left shift of 32 for some reason I don't quite get (I thought it shouldn't make a difference with those array formats if they are fetched on BE or LE but it looks like I'm wrong). The values then would have gone through lp_build_format_swizzle_aos() (and I have no idea if that swizzle looks different on BE) before finally getting transposed to SoA. The new code will now use one lp_build_fetch_rgba_soa() call. This will still end up with 4 gathers, but in the soa path which always use vector_justify of false (why? I have no idea but this was like that before), so you don't get the left shift of 32. Oh and the values will be fetched as 3x32bit instead of a 96bit int (this particular change was one of the changes preceding this commit, so you could verify independently if it breaks stuff, some piglit texture format tests for instance could show that - unfortunately lp_test_format only does (scalar) rgba_aos fetch, so not exactly helpful for that, but you really want rgba SoA fetches working in general, regardless of vertex fetch), if that makes any difference (again, I have no idea really) (it will do pad_vector, so use a shuffle to extend the 3x32bit values to 4x32bit instead of using ZExt to 1x128bit, but I'm not worried about that particular bit). The values will then be transposed and finally going into lp_build_format_swizzle_soa(). So, my guess is maybe things would work a bit better if you'd hack up the vector_justify parameter to lp_build_gather() in lp_build_fetch_rgba_soa(). However, this near certainly breaks all the other callers of lp_build_fetch_rgba_soa(), which is used for just about all texture formats except the rgba8 ones, so glxgears and desktop compositors might still run but probably not much else, you don't want to do that...
Created attachment 130977 [details] [review] patch to aid in troubleshooting The attachment above fixes glxgears, gnome-shell, and bunch of piglit tests (though piglit is still running so I haven't had a chance to see how it compares to pre commit 1e8c0dcbf5f1c6ca5b0627c052cb16de8ce3f886 ) I'm not trying to get this change upstream, as-is. I'm mainly just posting this here for the benefit of others who are working on this issue. There's a good chance the above patch isn't the "right" fix, but it might help explain to someone who knows and understands the surrounding code what's going wrong. To that end, I included the entire function as patch context.
I suppose this patch m(In reply to Ray Strode [halfline] from comment #4) > Created attachment 130977 [details] [review] [review] > patch to aid in troubleshooting I suppose this makes sense. Sort of. I was missing that even for array formats, the shift in the channel description is different on big endian. e.g. (from u_format_table.c) const struct util_format_description util_format_r32g32b32a32_float_description = { PIPE_FORMAT_R32G32B32A32_FLOAT, "PIPE_FORMAT_R32G32B32A32_FLOAT", "r32g32b32a32_float", {1, 1, 128}, /* block */ UTIL_FORMAT_LAYOUT_PLAIN, 4, /* nr_channels */ TRUE, /* is_array */ FALSE, /* is_bitmask */ FALSE, /* is_mixed */ #ifdef PIPE_ARCH_BIG_ENDIAN { {UTIL_FORMAT_TYPE_FLOAT, FALSE, FALSE, 32, 96}, /* x = r */ {UTIL_FORMAT_TYPE_FLOAT, FALSE, FALSE, 32, 64}, /* y = g */ {UTIL_FORMAT_TYPE_FLOAT, FALSE, FALSE, 32, 32}, /* z = b */ {UTIL_FORMAT_TYPE_FLOAT, FALSE, FALSE, 32, 0} /* w = a */ }, #else { {UTIL_FORMAT_TYPE_FLOAT, FALSE, FALSE, 32, 0}, /* x = r */ {UTIL_FORMAT_TYPE_FLOAT, FALSE, FALSE, 32, 32}, /* y = g */ {UTIL_FORMAT_TYPE_FLOAT, FALSE, FALSE, 32, 64}, /* z = b */ {UTIL_FORMAT_TYPE_FLOAT, FALSE, FALSE, 32, 96} /* w = a */ }, So maybe instead of chan_desc.shift / type.width on big endian it needs to be (format_desc.block.bits - chan_desc.shift ) / type.width. (Clearly you can't just use "i" since that would completely ignore ordering albeit for most formats it should be the same.) There's several places where we handle channeldesc shifts differently already in gallivm, albeit usually that's for within-dwords shifts (e.g. 4x8 values packed). I still don't quite understand it though - if we had accessed that as an ordinary array (as the format is an array format) then the addresses would effectively be all the same on big or little endian (as it's just 4 floats within an array), so I'm not quite sure why the shifts are actually different. But I better don't think too much about it, it just gives me a headache.
(In reply to Roland Scheidegger from comment #5) > So maybe instead of chan_desc.shift / type.width > on big endian it needs to be (format_desc.block.bits - chan_desc.shift ) / > type.width. I guess that should be (format_desc.block.bits - (chan_desc.shift + chan_desc.size) / type.width ? I"ll give that a try and post the piglit results here.
Created attachment 130980 [details] [review] the patch i'm running through piglit now
With attachment 130980 [details] [review], things are mostly working now. piglit found a couple regressions: https://people.freedesktop.org/~halfline/results-summary-shift-change/regressions.html
hard coding vector_justify to TRUE fixes the draw-vertices: Short vertices - components: 3, stride: 6, Short vertices - components: 2, stride: 8, tests.
(In reply to Ray Strode from comment #8) > With attachment 130980 [details] [review] [review], things are mostly working now. > piglit found a couple regressions: > > https://people.freedesktop.org/~halfline/results-summary-shift-change/ > regressions.html Sure looks like all the same problem - the 3x16bit formats (PIPE_FORMAT_R16G16G16_X) don't work. Might be because they are handled differently by lp_build_gather: 2x16 and 4x16 are loaded as exactly that. Whereas 3x16 will load as 1x48 plus zext to 64bit (vector_justify would add a shift). (The same isn't true for 3x32, which is loaded as 3x32 in lp_build_gather() then the vector padded.) I'm not sure maybe loading as 3x16 plus vector padding, analogous to what is done for 3x32 would work? The code for doing so should work, it's purely not done this way because codegen on x86 produces a (albeit correct) mess - but it might fare better on ppc... Otherwise some vector_justify in the right cases might fix this too... That said, the lp_build_gather() calls are supposed to give the same results regardless if things are fetched as vectors or scalars - this is meant purely for optimization purposes. And I've got some feeling that's not really the case on BE (if it isn't, that code is very very likely to break again in the future).
(In reply to Ray Strode from comment #9) > hard coding vector_justify to TRUE fixes the draw-vertices: > > Short vertices - components: 3, stride: 6, > Short vertices - components: 2, stride: 8, > > tests. I only see failures with the 3-component formats. Half floats though might require more fixes as they have special conversion, not sure. There actually seem to be a fair amount of fixes related to texformats tests compared to the old run. So it's definitely moving into the right direction, albeit not quite there yet. The regressions look mostly harmless in comparison to me since there's so many formats still failing (formats with npot sized block always require special handling everywhere....).
(In reply to Roland Scheidegger from comment #11) > > Short vertices - components: 2, stride: 8, > I only see failures with the 3-component formats. Indeed, I may have copy and pasted the wrong line there or something.
Created attachment 131000 [details] [review] patch that didn't help at all Hi, (In reply to Roland Scheidegger from comment #10) > I'm not sure maybe loading as 3x16 plus vector padding, analogous to what is > done for 3x32 would work? The code for doing so should work, it's purely not > done this way because codegen on x86 produces a (albeit correct) mess - but > it might fare better on ppc... So to be clear, I don't really grok the code. I did have a go at what I thought you meant, using the quick-and-dirty proof-of-concept patch above but it failed in the same way as doing the scalar fetch.
(In reply to Ray Strode [halfline] from comment #13) > Created attachment 131000 [details] [review] [review] > patch that didn't help at all > > Hi, > (In reply to Roland Scheidegger from comment #10) > > I'm not sure maybe loading as 3x16 plus vector padding, analogous to what is > > done for 3x32 would work? The code for doing so should work, it's purely not > > done this way because codegen on x86 produces a (albeit correct) mess - but > > it might fare better on ppc... > So to be clear, I don't really grok the code. I did have a go at what I > thought you meant, using the quick-and-dirty proof-of-concept patch above > but it failed in the same way as doing the scalar fetch. This looks good to me. I actually thought the code would be easily switchable, but I suppose I nuked that when I noticed there's no point on doing 3x16bit fetches on x86... (The fetch_dst_type.length adjustment seems unnecessary but harmless.) So, if that doesn't help, this at least suggests loading 3x16 plus pad is the same as loading 1x48 plus zext. This is actually good news as it would be really confusing otherwise... So it just needs a vector_justify somewhere.
This isn't really an OpenSWR (Drivers/Gallium/swr) specific problem. Is there another component that it should be moved to?
(In reply to Bruce Cherniak from comment #15) > This isn't really an OpenSWR (Drivers/Gallium/swr) specific problem. Is > there another component that it should be moved to? We don't really have an appropriate component... I think such issues often end up with either "Mesa core" or "Other", the former isn't quite right and the latter of course not all that helpful...
Created attachment 131171 [details] [review] another patch that may aid in understanding what's going on Just to follow up again, we played around a bit more and figured some stuff out but we're still kind of banging around in the dark. What we learned: - The reason vector_justify = TRUE fixed one of the tests, but kept some failing is because it was reversing X and Y coordinates, so only the first 3 component sshort test run was getting placed properly (because the test case has x and y start with the same value). The subsequent ones were growing down instead of right. - of course we can then get it working by changing the swizzle order in u_format.csv to yxz1 for the 3 component short and 3 component half float tests. clearly not the "right" fix, but was interesting to see all the tests working. - if i leave vector_justify to FALSE, but change attachment 130980 [details] [review] to use fetch_width instead of format_desc->block.bits then all the 3 component sshort tests and half float tests start working, but then r32g32b32_sscaled start failing. So, the above patch fudges it to make everything in draw-vertices and draw-vertices-half-float work. Obviously not shippable, but might aid in figuring out what's going on?
Comment on attachment 131171 [details] [review] another patch that may aid in understanding what's going on oops attached the wrong patch.
Comment on attachment 131171 [details] [review] another patch that may aid in understanding what's going on err no, nevermind it's the right one, too many browser tabs open.
Hi, > - if i leave vector_justify to FALSE, but change attachment 130980 [details] [review] > to use fetch_width instead of format_desc->block.bits then all the > 3 component sshort tests and half float tests start working, but then > r32g32b32_sscaled start failing. So, the above patch fudges it to make > everything in draw-vertices and draw-vertices-half-float work. Just to expand on this... If we comment out the 3x32bit vector fetch special case alluded to comment 10, and instead rely on a 96bit scalar fetch, (just as we rely on a 48bit scalar fetch for r16g16b16), then using fetch_width instead of format_desc->block.bits in the vec_nr computation works for the 32bit case, too. So, Roland, your first intuition appears to be right, the scalar and vector paths are different. Probably attachment 131000 [details] [review] is wrong, or I mucked up the testing, or something. But, of course, we should get the scalar case working regardless... For 3 r16g16b16 (well x16y16z16) vertices: short v3[] = { x1, y1, z1, x1, y2, z2, x2, y1, z3 }; a scalar 48bit fetch leaves things like this: packed[0] = [ pad x1 y1 z1 pad x1 y2 z2 ] packed[1] = [ pad x2 y1 z3 pad x2 y1 z3 ] (where pad means zeros from the zero-extend-to-64bits op) which then get grouped into 32-bit quantities, merged and reordered like so: dst[0] = [ [pad x1] [pad x1] [pad x2] [pad x2] ] dst[1] = [ [ y1 z1] [ y2 z2] [ y1 z3] [ y1 z3] ] in this layout, vec_nr needs to be 0, then 1, then 1 to get the correct values, which is achieved with vec_nr = (fetch_width - (chan_desc.shift + chan_desc.size)) / type.width); Of course, I think we want to unconditionally (on big endian) use: vec_nr = (format_desc->block.bits - (chan_desc.shift + chan_desc.size)) / type.width; instead, which leaves vec_nr as 0, then 0, then 1. In order for that to work, it requires dst[0]837060 to look like: dst[0] = [ [y1 x1] [y2 x1] [y1 x2] [y1 x2] ] We almost get there by using vector_justify = TRUE. it leaves us with: packed[0] = [ x1 y1 z1 pad x1 y2 z2 pad ] packed[1] = [ x2 y1 z3 pad x2 y1 z3 pad ] dst[0] = [ [x1 y1] [x1 y2] [x2 y1] [x2 y1] ] So, when vector_justify is TRUE, we end up with almost what we want. Just the x and y coordinates are sitting in each 32 bit word in reverse. (i'm omitting dst[1] for brevity, but it works out to byteswap those, too). This is why adding the u_format.csv big endian entry for the very odd looking swizzle format yxz1 worked. It's letting the chan swizzle at the end do the swapping for us. I don't think it's a good idea to rely on that though. So I guess in the cases we do a scalar fetch we need to justify and swap ? I guess it's somewhat analogous to say short foo[2] = [ 0x1111, 0x2222 ]; int32_t bar = *(int32_t *) foo where you normally would need to swap on bigendian if you want to keep 0x1111 first/least significant.
Unless I'm misunderstanding something, I think this comment in u_format.h is the crux of the issue: * If each channel is accessed as an individual N-byte value, X is always• * at the lowest address in memory, Y is always next, and so on. For all• * currently-defined formats, the N-byte value has native endianness.• *• * If instead a group of channels is accessed as a single N-byte value,• * the order of the channels within that value depends on endianness.• * For big-endian targets, X is the most significant subvalue,• * otherwise it is the least significant one.• *• I guess vector fetch is the first paragraph, and scalar fetch is the second paragraph. So they can't behave the same on big endian unless we introduce swaps.
(In reply to Ray Strode [halfline] from comment #21) > Unless I'm misunderstanding something, I think this comment in u_format.h is > the crux of the issue: > > * If each channel is accessed as an individual N-byte value, X is always• > * at the lowest address in memory, Y is always next, and so on. For all• > * currently-defined formats, the N-byte value has native endianness.• > *• > * If instead a group of channels is accessed as a single N-byte value,• > * the order of the channels within that value depends on endianness.• > * For big-endian targets, X is the most significant subvalue,• > * otherwise it is the least significant one.• > *• > > I guess vector fetch is the first paragraph, and scalar fetch is the second > paragraph. So they can't behave the same on big endian unless we introduce > swaps. I think this has more to do with packed formats - for things like 10/10/10/2 you can't really access that as multiple individual bytes. But yes I suppose it's that way because scalar and vector fetches aren't the same. Maybe the logic should be simplified for soa fetch for big-endian (e.g. always use scalars or vectors). Whatever works (well ideally you'd use what generates better code but I have no idea there on big endian or specifically ppc boxes.)
(In reply to Ray Strode from comment #20) > in this layout, vec_nr needs to be 0, then 1, then 1 to get the correct > values, which is achieved with > > vec_nr = (fetch_width - (chan_desc.shift + chan_desc.size)) / type.width); > > Of course, I think we want to unconditionally (on big endian) use: > > vec_nr = (format_desc->block.bits - (chan_desc.shift + chan_desc.size)) / > type.width; Well, it doesn't really matter I suppose. Albeit it's probably easier to understand if the first elements end up in the first vector. > instead, which leaves vec_nr as 0, then 0, then 1. In order for that to > work, it requires dst[0]837060 > to look like: > > dst[0] = [ [y1 x1] [y2 x1] [y1 x2] [y1 x2] ] > > We almost get there by using vector_justify = TRUE. it leaves us with: > > packed[0] = [ x1 y1 z1 pad x1 y2 z2 pad ] > packed[1] = [ x2 y1 z3 pad x2 y1 z3 pad ] > > dst[0] = [ [x1 y1] [x1 y2] [x2 y1] [x2 y1] ] > > So, when vector_justify is TRUE, we end up with almost what we want. Just > the x and y coordinates are sitting in each 32 bit word in reverse. (i'm > omitting dst[1] for brevity, but it works out to byteswap those, too). This > is why adding the u_format.csv big endian entry for the very odd looking > swizzle format yxz1 worked. It's letting the chan swizzle at the end do the > swapping for us. I don't think it's a good idea to rely on that though. So > I guess in the cases we do a scalar fetch we need to justify and swap ? > > I guess it's somewhat analogous to say > > short foo[2] = [ 0x1111, 0x2222 ]; > int32_t bar = *(int32_t *) foo > > where you normally would need to swap on bigendian if you want to keep > 0x1111 first/least significant. Yes, that would make sense. Albeit we really want to do something which makes sense (so, we don't want to do some load where the cpu has to do extra work just so it ends up in a wrong order we will need to swizzle some more), which I can't quite tell. But if things are simpler with always using scalar loads for instance, could just do that on BE.
(mid-air collision) (In reply to Roland Scheidegger from comment #22) > I think this has more to do with packed formats - for things like 10/10/10/2 > you can't really access that as multiple individual bytes. Ah, yea, actually if I scroll down a little in the editor I see it gives an example of each case (8/24 for the first case, and 5/5/5/1 for the second case). > Maybe the logic should be simplified for soa fetch for big-endian (e.g. > always use scalars or vectors). Whatever works (well ideally you'd use what > generates better code but I have no idea there on big endian or specifically > ppc boxes.) Yea, I guess thinking about it more, even if we can get scalar fetch to work with sufficient twiddling, that twiddling probably introduces extra operations per element, so maybe not a good idea. I guess we should take another crack at attachment 131000 [details] [review] first.
(In reply to Ray Strode from comment #24) ... > Yea, I guess thinking about it more, even if we can get scalar fetch to work > with sufficient twiddling, that twiddling probably introduces extra > operations per element, so maybe not a good idea. I guess we should take > another crack at attachment 131000 [details] [review] [review] first. I have been looking at the assembly code for the 3x16 case generated on both big- and little-endian machines. This case stems from the piglit/tests/general/draw-vertices:test_short_vertices function. First, the LLVM IR I'm focusing is the IR generated by lp_build_gather_elem_vec (called by lp_build_gather, called by lp_build_fetch_rgba_soa, called by fetch_vector...); the IR looks like this: %"lp_build_gather_elem_ptr:72.21" = extractelement <2 x i32> %"lp_build_fetch_rgba_soa:557.", i32 0 %"lp_build_gather_elem_ptr:75.22" = getelementptr i8, i8* %map_ptr, i32 %"lp_build_gather_elem_ptr:72.21" %"lp_build_gather_elem_vec:189.23" = bitcast i8* %"lp_build_gather_elem_ptr:75.22" to i48* %"lp_build_gather_elem_vec:190.24" = load i48, i48* %"lp_build_gather_elem_vec:189.23", align 1 where I've used that last parameter to the LLVMBuild* calls, the parameter that appears as a null string in the production code, to contain the function name and the line number, which end up getting inserted in LLVM's result name. If you prefer the IR without the debug info, here it is: %21 = extractelement <2 x i32> %"0, i32 0 %22 = getelementptr i8, i8* %map_ptr, i32 %21 %23 = bitcast i8* %22 to i48* %24 = load i48, i48* %23, align 1 I've modified the data in the draw-vertices program so the values stick out in the registers; the data I'm using for this example looks like this: (gdb) x/6h $r4 0x1029ca00: 0x0015 0x0011 0x1111 0x0015 0x0016 0x2222 X1 Y1 Z1 X1 Y2 Z2 In general, PPC assembly code is a three-operand code where the syntax is (usually) OPCODE target, src1, src2 Load/store syntax is LOAD RT, immediate-displacement(RA) LOADx RT, RA, RB ;; where effective addr is RA + RB (LOADx = load indexed) LOADux RT, RA, RB ;; where effective addr is RA + RB (LOADux = load indexed w/ update; RT <- MEM(RA + RB) AND RA <- RA + RB) STORE RS, immediate-displacement(RA) STOREx RS, RA, RB ;; where effective addr is RA + RB (STOREx = store indexed) STOREux RS, RA, RB ;; where effective addr is RA + RB (STOREx = store indexed w/ update) Both the LOAD and the LOADux variants appear below. The Rotate instructions have more complex syntax with four operands. The target is on the left, as usual; the source operands are: . source register; . the (immediate) number of bits to shift; . an (immediate) mask specification M, with the semantics "AND the penultimate result with a mask consisting of bits 0:M = 1, M+1:63 = -" Please note that these descriptions are over-simplified. It is important to note that, whether the machine is little-endian or big-endian, BITS IN A REGISTER ARE NUMBERED FROM THE LEFT. I.e., the most significant bit is bit 0, ... the least significant bit is bit 63. On a (little-endian) PPC64LE machine, the assembly code looks like this: => 0x3fffab370404: lwzux r3,r4,r3 ;; Load Word w/ zero-extend & update ; r3 <- 0x11.0015 = Y.X 0x3fffab370408: lhz r4,4(r4) ;; Load halfword w/ zero-extend ; r4 <- 0x1111 = Z 0x3fffab37040c: std r2,24(r1) 0x3fffab370410: rldicr r4,r4,32,31 ;; Rotate Left doubleword immediate & clear right; imm = 32, mask = 0:31 => r4 <- 0x1111.0000.0000 = 0.Z.0.0 0x3fffab370414: or r25,r3,r4 ;; r25 <- 0x1111.0011.0015 = Z.Y.X So, the operation of loading a 48-bit int corresponds well with loading the 3-vector of int16's into the 64-bit target register. On a big-endian PPC64 machine, the assembly code looks like this: 0x3fffaace0538: lwzux r3,r4,r3 ;; r3 = 0, r4 = 0x10273d80; r3 <- 0x15.0011, i.e. X.Y 0x3fffaace053c: lhz r4,4(r4) ;; r4 <- 0x1111 = Z 0x3fffaace0540: rldicr r3,r3,16,47 ;; Rotate Left doubleword immediate & clear right; imm = 16, mask = 0:47 => r3 <- 0x15.0011.0000, i.e. r3 <<= 16 = 0.X.Y.0 ... 0x3fffaace0548: or r24,r4,r3 ;; r4 <- 0x15.0011.1111, i.e. 0.X.Y.Z Note that no single operation--shift, justification, "zero-extend" or anything else--can get the 16-bit fields into the proper order for subsequent code. Regarding Ray's specific comment about getting scalar fetch to work with "sufficient twiddling," I think it's perfectly acceptable to introduce extra operations, as long as we restrict the extra operations to the big-endian path. PPC64 (LE or BE) is fast enough so that any performance impact will be negligible; S390 is less fast, but I imagine production machines with more memory than the one we experimented on here are fast enough. And remember that this code is going into the vertex shader program, where there are already performance hurdles like branches.
(In reply to Ben Crocker from comment #25) > > Regarding Ray's specific comment about getting scalar fetch to work > with "sufficient twiddling," I think it's perfectly acceptable to > introduce extra operations, as long as we restrict the extra > operations to the big-endian path. PPC64 (LE or BE) is fast enough so > that any performance impact will be negligible; S390 is less fast, but > I imagine production machines with more memory than the one we > experimented on here are fast enough. drive-by comment.. unless llvm is just rubbish at optimization, I don't think saving a few operations in the front-end IR building should be that important, even for LE. But we have shader-db so it should be possible to prove/disprove that theory. (Not sure if llvmpipe is instrumented for shader-db but if not that should be easy to solve.)
*** Bug 101211 has been marked as a duplicate of this bug. ***
The attached patch to lp_build_gather_elem_vec fixes the regressions in, e.g., the draw-vertices 16-bit tests. I'll spare you the assembly code this time, but suffice it to say the code rearranges the 0.X.Y.Z subvalues in the GPR to the desired order via suitable Bitcast and ShuffleVector operations. In fact, the code for the load operation starts off identically; the (new) code that shuffles the values into the correct positions occurs much later and, in fact, operates on two vertices at once!
Created attachment 131577 [details] [review] Patch to lp_build_gather_elem_vec THIS is the attachment I meant to attach to Comment 28.
(In reply to Rob Clark from comment #26) > (In reply to Ben Crocker from comment #25) > > > > Regarding Ray's specific comment about getting scalar fetch to work > > with "sufficient twiddling," I think it's perfectly acceptable to > > introduce extra operations, as long as we restrict the extra > > operations to the big-endian path. PPC64 (LE or BE) is fast enough so > > that any performance impact will be negligible; S390 is less fast, but > > I imagine production machines with more memory than the one we > > experimented on here are fast enough. > > drive-by comment.. unless llvm is just rubbish at optimization, I don't > think saving a few operations in the front-end IR building should be that > important, even for LE. Well, yes and no. Yes, if it makes things conceptually simpler (which probably isn't really the case here). I'm not sure how good llvm is there with the ppc backend. But for x86, no you can't assume optimization will take care of everything neatly, in particular for load/shuffle combinations. If you look at it, there is in fact lots of hack code around gathering of values (on x86), simply because llvm can't do some kind of optimizations (in particular, it can't do any optimizations crossing scalar/vector boundaries, so if you zero-extend values after a scalar load or after assembling into vectors makes a large difference in generated code quality, or if you use a int load llvm will not consider using float shuffles afterwards even if it means using 3 shuffle instructions instead of just 1 and so on (llvm has no real model of domain transition penalty costs, which don't exist in these cases on most cpus), albeit that latter problem has been fixed with llvm 4.0). However, I would not expect these particular bits to be a problem on non-x86 cpus. I think int/float issues are x86 specific (other simd instructions sets afaik don't tend to have different int/float load/store, shuffle or even logic op operations). So, going for the conceptually simplest solution should be alright (albeit for instance the scalar/vector "optimization barrier" is probably going to affect all backends). > But we have shader-db so it should be possible to > prove/disprove that theory. (Not sure if llvmpipe is instrumented for > shader-db but if not that should be easy to solve.) Yeah I suppose should really do that at some point...
(In reply to Roland Scheidegger from comment #30) > (In reply to Rob Clark from comment #26) > > (In reply to Ben Crocker from comment #25) > > > > > > Regarding Ray's specific comment about getting scalar fetch to work > > > with "sufficient twiddling," I think it's perfectly acceptable to > > > introduce extra operations, as long as we restrict the extra > > > operations to the big-endian path. PPC64 (LE or BE) is fast enough so > > > that any performance impact will be negligible; S390 is less fast, but > > > I imagine production machines with more memory than the one we > > > experimented on here are fast enough. > > > > drive-by comment.. unless llvm is just rubbish at optimization, I don't > > think saving a few operations in the front-end IR building should be that > > important, even for LE. > Well, yes and no. Yes, if it makes things conceptually simpler (which > probably isn't really the case here). > I'm not sure how good llvm is there with the ppc backend. But for x86, no > you can't assume optimization will take care of everything neatly, in > particular for load/shuffle combinations. If you look at it, there is in > fact lots of hack code around gathering of values (on x86), simply because > llvm can't do some kind of optimizations (in particular, it can't do any > optimizations crossing scalar/vector boundaries, so if you zero-extend > values after a scalar load or after assembling into vectors makes a large > difference in generated code quality, or if you use a int load llvm will not > consider using float shuffles afterwards even if it means using 3 shuffle > instructions instead of just 1 and so on (llvm has no real model of domain > transition penalty costs, which don't exist in these cases on most cpus), > albeit that latter problem has been fixed with llvm 4.0). > However, I would not expect these particular bits to be a problem on non-x86 > cpus. I think int/float issues are x86 specific (other simd instructions > sets afaik don't tend to have different int/float load/store, shuffle or > even logic op operations). So, going for the conceptually simplest solution > should be alright (albeit for instance the scalar/vector "optimization > barrier" is probably going to affect all backends). > > > But we have shader-db so it should be possible to > > prove/disprove that theory. (Not sure if llvmpipe is instrumented for > > shader-db but if not that should be easy to solve.) > Yeah I suppose should really do that at some point... I want to emphasize at this point that the patch I described in Comments 28-29 is compile-time conditionalized for big-endian only.
(In reply to Ben Crocker from comment #29) > Created attachment 131577 [details] [review] [review] > Patch to lp_build_gather_elem_vec > > THIS is the attachment I meant to attach to Comment 28. Yes, that looks like it could work. Albeit the code in gather really ought to work too if we'd decided to fetch 4x16 values the same way as 3x16 (so, as a single 64bit value instead of 2x32) and I don't think that's the case. In any case, you should send any patches you think are ready to mesa-dev. (This particular patch would need some comments, plus you should use a helper for constructing the type16 type - I don't think we ever initialize those directly.) I think though overall this is still all quite messy (I'd be fine with just always doing vector-as-scalar or vector-as-vector fetch on BE (both for the soa format code and gather code) and not both if it means I'm going to break those paths less often...)
(In reply to Stefan Dirsch from comment #0) > I've observed a regression in Mesa 17 on s390x (zSystems). It's hard to > describe. Here is a video on Youtube. > > https://youtu.be/xaYauE0Kn8E > > On the left side glxgears is running on x86_64, on the left side it's > running on s390x(zSystems) - a 64bit bigendian platform. gdm and gnome-shell > show a black screen. I haven't screenhoted that though (since it's boring). > > I git bisected Mesa and the culprit patch was > > From e827d9175675aaa6cfc0b981e2a80685fb7b3a74 Mon Sep 17 00:00:00 2001 > From: Roland Scheidegger <sroland@vmware.com> > Date: Wed, 21 Dec 2016 04:43:07 +0100 > Subject: [PATCH] draw: use SoA fetch, not AoS one > > Now that there's some SoA fetch which never falls back, we should always get > results which are better or at least not worse (something like rgba32f will > stay the same). > > When reverse-applying this patch on Mesa 17.0.3 the regression on s390x gets > fixed. > > Mesa has been compiled with > > ./configure --host=s390x-ibm-linux-gnu --build=s390x-ibm-linux-gnu > --program-prefix= --disable-dependency-tracking --prefix=/usr > --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc > --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 > --libexecdir=/usr/lib --localstatedir=/var --sharedstatedir=/usr/com > --mandir=/usr/share/man --infodir=/usr/share/info > --disable-dependency-tracking --enable-gles1 --enable-gles2 --enable-dri > --with-egl-platforms=x11,drm --enable-shared-glapi --enable-texture-float > --enable-osmesa --enable-dri3 --enable-shader-cache --enable-gbm > --enable-glx-tls --with-dri-searchpath=/usr/lib64/dri --enable-gallium-llvm > --enable-llvm-shared-libs --enable-vdpau --enable-va --enable-xvmc > --with-dri-drivers=swrast --with-gallium-drivers=swrast > 'CFLAGS=-fmessage-length=0 -grecord-gcc-switches -fstack-protector -O2 -Wall > -D_FORTIFY_SOURCE=2 -funwind-tables -fasynchronous-unwind-tables -g -DNDEBUG' Hi Stefan sorry for asking this small ot, how is the situation there on RadeonSi Mesa EGL? here on Qoriq and on PowerMac G5 glamor crashes if invoche egl_radeonsi and made the Xorg not usable. There there is the same situation?
(In reply to intermediadc@hotmail.com from comment #33) > Hi Stefan sorry for asking this small ot, > how is the situation there on RadeonSi Mesa EGL? > here on Qoriq and on PowerMac G5 glamor crashes if invoche egl_radeonsi and > made the Xorg not usable. > There there is the same situation? I haven't heard of such issues yet.
just for information here my story about with Michel Danze reply https://bugs.freedesktop.org/show_bug.cgi?id=99859#c19 But there are many post on many big Endian hardware, was only curious if on s390 there was the same issue or not. Thanks Luigi
(In reply to intermediadc@hotmail.com from comment #35) > just for information > here my story about with Michel Danze reply > https://bugs.freedesktop.org/show_bug.cgi?id=99859#c19 > > But there are many post on many big Endian hardware, > was only curious if on s390 there was the same issue or not. > > Thanks > Luigi So far, at least on this bug (or cluster of bugs), we have seen the same behavior on both S390 and big-endian Power8 (PPC64).
(In reply to intermediadc@hotmail.com from comment #35) > https://bugs.freedesktop.org/show_bug.cgi?id=99859#c19 The radeonsi driver is known not to work yet on big endian hosts. Is there anything unclear about this statement?
(In reply to intermediadc@hotmail.com from comment #35) > just for information > here my story about with Michel Danze reply > https://bugs.freedesktop.org/show_bug.cgi?id=99859#c19 > > But there are many post on many big Endian hardware, > was only curious if on s390 there was the same issue or not. Well s390x comes without any gfx hardware. And that's the only BE platform SUSE supports with our current enterprise product (where we use Mesa 17). We also support ppc64le, but this is obviously LE - unlike ppc64.
(In reply to Ben Crocker from comment #36) > So far, at least on this bug (or cluster of bugs), we have seen the > same behavior on both S390 and big-endian Power8 (PPC64). Power8 is ppc64le, isn't it? Confused ...
(In reply to Stefan Dirsch from comment #39) > (In reply to Ben Crocker from comment #36) > > So far, at least on this bug (or cluster of bugs), we have seen the > > same behavior on both S390 and big-endian Power8 (PPC64). > > Power8 is ppc64le, isn't it? Confused ... Power8 can operate in either little-endian (PPC64LE) or big-endian (PPC64) mode. RHEL, Fedora, etc. support both. I've been focusing on Power8 because: 1) it's what I'm more familiar with; 2) we mostly see the same problems on PPC64(BE) and S390.
(In reply to Michel Dänzer from comment #37) > (In reply to intermediadc@hotmail.com from comment #35) > > https://bugs.freedesktop.org/show_bug.cgi?id=99859#c19 > > The radeonsi driver is known not to work yet on big endian hosts. > > Is there anything unclear about this statement? Hi Michel, in a bigendian user some time sow something working on G5 and not working on Qoriq and vice versa or working on G4 (mac mini,Chrp,AmigaOne XE)and not running on G5 (see different Hw architecture, isa, and so and so). Was only courious if this SI issue was present on another hardware different from mine. Just curiosity nothing more, and me an many others are hoping will be fixed one day...
Created attachment 133677 [details] [review] lp_build_gather_elem_vec big-endian fix for 3x16 load In reply to Roland's Comment 32: Roland, thanks for the constructive feedback. Here is what the patch looks like now. I don't disagree that this is messy, but it DOES resolve (most of) the Piglit regressions. I agree that the code in lp_bld_gather should work for 4x16 bit vectors, but SOMETHING appears to have gone right already for 4x16 bit vectors, as the only regressions seen had to do with 3x16 vectors.
(In reply to Ben Crocker from comment #42) > Created attachment 133677 [details] [review] [review] > lp_build_gather_elem_vec big-endian fix for 3x16 load > > In reply to Roland's Comment 32: > > Roland, thanks for the constructive feedback. Here is what the patch > looks like now. > > I don't disagree that this is messy, but it DOES resolve (most of) the > Piglit regressions. > > I agree that the code in lp_bld_gather should work for 4x16 bit > vectors, but SOMETHING appears to have gone right already for 4x16 bit > vectors, as the only regressions seen had to do with 3x16 vectors. Clarification: the only regressions seen AFTER Ray Strode's patch (attachment 130980 [details] [review]).
(In reply to Ben Crocker from comment #43) > (In reply to Ben Crocker from comment #42) > > Created attachment 133677 [details] [review] [review] [review] > > lp_build_gather_elem_vec big-endian fix for 3x16 load > > > > In reply to Roland's Comment 32: > > > > Roland, thanks for the constructive feedback. Here is what the patch > > looks like now. > > > > I don't disagree that this is messy, but it DOES resolve (most of) the > > Piglit regressions. > > > > I agree that the code in lp_bld_gather should work for 4x16 bit > > vectors, but SOMETHING appears to have gone right already for 4x16 bit > > vectors, as the only regressions seen had to do with 3x16 vectors. > > Clarification: the only regressions seen AFTER Ray Strode's patch > (attachment 130980 [details] [review] [review]). By the way, draw-vertices-2101010 is failing across all architectures: X86 and PPC64LE as well as PPC64/S390x.
(In reply to Ben Crocker from comment #44) > By the way, draw-vertices-2101010 is failing across all architectures: > X86 and PPC64LE as well as PPC64/S390x. Yes, but this isn't related to swizzling (at least on x86, unless ppc64 has additional errors), but is rather due to GL weirdness with converting signed numbers. Pre-GL 4.2 there's 2 different formulas depending on if it's used as a vertex format or a pixel format, and the test expects the result of the formula used for vertex formats. GL 4.2 and up ditched this formula entirely (obviously, two formulas don't make all that much sense if pixel and vertex data is all in the same buffers anyway...), and that's the one we always use, no matter the GL version. The difference in the formulas usually isn't all that large, but with just 2 bits the error is huge: the four values you can represent are [-1.0, -0.333, 0.333, 1.0] for the old formula and [-1.0, -1.0, 0.0, 1.0] for the new one, hence the failure (Expected: ... 0.333000, Observed: ... 0.000000). Note that at least some hw/drivers will ignore these old GL rules too (after all, GL 4.2 is supposed to be backwards compatible, and clearly with the different formula gone it can't quite be), therefore we don't consider this a driver bug, rather a test bug: https://bugs.freedesktop.org/show_bug.cgi?id=59150
hi, i have to notice this mesa issue is present inside the qemu ppc/ppc64 vm machine too if vga std is used for render the graphic. the result is some window manager crashes if the distro that the have default build mesa swrast in llvm, same is with many applications. test distro is Fedora Server 26 PPC64 in qemu 2.8.1 Luigi
Stefan, we had a number of fixes in Mesa 17.2.1 and later that reference this bug. Is it save to assume the original regression is resolved, should we close this?
Emil, thans for asking. Indeed we've removed all these patches with Mesa 17.2.1. Therefore closing as fixed.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.