There is a wrong rendering on citra-emu on ivy bridge and haswell. llvmpipe, skylake and kabylake are verified to not be affected. I've tested 17.3.6 and git master. Broken screenshot: http://markus.members.selfnet.de/dolphin/citra0000079859.png apitrace dump: http://markus.members.selfnet.de/dolphin/citra-qt.trace The first broken draw call is frame 71, call 7343. I'm sorry for the vertex shader :/
Issue is reproducible on Haswell: OS: Ubuntu 16.04 LTS 64-bit CPU: Intel® Core™ i5-4310M CPU @ 2.70GHz × 4 GPU: Intel® Haswell Mobile mesa: OpenGL ES 3.2 Mesa 18.1.0-devel (git-fb5825e7ce) kernel: 4.13.0-26-generic Also I tried mesa back to 12.1.0 and issue is reproducible there.
Bisected: 2458ea95c5676807a064f24ec720f12506975402 is the first bad commit commit 2458ea95c5676807a064f24ec720f12506975402 Author: Jason Ekstrand <jason.ekstrand@intel.com> Date: Wed Sep 9 14:40:06 2015 -0700 nir/lower_vec_to_movs: Coalesce movs on-the-fly when possible The old pass blindly inserted a bunch of moves into the shader with no concern for whether or not it was really needed. This adds code to try and coalesce into the destination of the instruction providing the value. Shader-db results for vec4 shaders on Haswell: total instructions in shared programs: 1754420 -> 1747753 (-0.38%) instructions in affected programs: 231230 -> 224563 (-2.88%) helped: 1017 HURT: 2 This approach is heavily based on a different patch by Eduardo Lima Mitev <elima@igalia.com>. Eduardo's patch did this in a separate pass as opposed to integrating it into nir_lower_vec_to_movs. Reviewed-by: Eduardo Lima Mitev <elima@igalia.com> Issue is related to optimizations in try_coalesce() function. With the simple change bug is no longer reproducible: diff --git a/src/compiler/nir/nir_lower_vec_to_movs.c b/src/compiler/nir/nir_lower_vec_to_movs.c index 711ddd3..73efa0e 100644 --- a/src/compiler/nir/nir_lower_vec_to_movs.c +++ b/src/compiler/nir/nir_lower_vec_to_movs.c @@ -263,8 +263,8 @@ lower_vec_to_movs_block(nir_block *block, nir_function_impl *impl) if (!(vec->dest.write_mask & (1 << i))) continue; - if (!(finished_write_mask & (1 << i))) - finished_write_mask |= try_coalesce(vec, i); + //if (!(finished_write_mask & (1 << i))) + // finished_write_mask |= try_coalesce(vec, i); if (!(finished_write_mask & (1 << i))) finished_write_mask |= insert_mov(vec, i, shader); Below are the results of Shader-db on Haswell with this patch: total instructions in shared programs: 241498 -> 241599 (0.04%) instructions in affected programs: 25728 -> 25829 (0.39%) helped: 14 HURT: 29 total cycles in shared programs: 142924052 -> 143021242 (0.07%) cycles in affected programs: 85446208 -> 85543398 (0.11%) helped: 16 HURT: 64
I also run the test with INTEL_DEBUG=vs option with the patch (comment 2) and without it and got following result: @@ -76,20 +76,16 @@ impl main { vec1 32 ssa_1 = load_const (0x00000000 /* 0.000000 */) vec4 32 ssa_2 = intrinsic load_input (ssa_1) () (0, 0) /* base=0 */ /* component=0 */ /* in_0 */ vec4 32 ssa_3 = intrinsic load_uniform (ssa_1) () (0, 64) /* base=0 */ /* range=64 */ /* parameters */ - vec4 32 ssa_4 = fdot_replicated4 ssa_2, ssa_3 + r4.x = fdot_replicated4 ssa_2, ssa_3 vec1 32 ssa_5 = load_const (0x00000010 /* 0.000000 */) vec4 32 ssa_6 = intrinsic load_uniform (ssa_5) () (0, 64) /* base=0 */ /* range=64 */ /* parameters */ - vec4 32 ssa_7 = fdot_replicated4 ssa_2, ssa_6 + r4.y = fdot_replicated4 ssa_2, ssa_6 vec1 32 ssa_8 = load_const (0x00000020 /* 0.000000 */) vec4 32 ssa_9 = intrinsic load_uniform (ssa_8) () (0, 64) /* base=0 */ /* range=64 */ /* parameters */ - vec4 32 ssa_10 = fdot_replicated4 ssa_2, ssa_9 + r4.z = fdot_replicated4 ssa_2, ssa_9 vec1 32 ssa_11 = load_const (0x00000030 /* 0.000000 */) vec4 32 ssa_12 = intrinsic load_uniform (ssa_11) () (0, 64) /* base=0 */ /* range=64 */ /* parameters */ - vec4 32 ssa_13 = fdot_replicated4 ssa_2, ssa_12 - r4.x = imov ssa_4.x - r4.y = imov ssa_7.x - r4.z = imov ssa_10.x - r4.w = imov ssa_13.x + r4.w = fdot_replicated4 ssa_2, ssa_12 vec4 32 ssa_15 = intrinsic load_input (ssa_1) () (1, 0) /* base=1 */ /* component=0 */ vec4 32 ssa_16 = intrinsic load_input (ssa_1) () (2, 0) /* base=2 */ /* component=0 */ /* in_2 */ r5.x = imov ssa_16.x
The NIR change looks fine to me. Can you attach full INTEL_DEBUG=vs dumps with the back-end IR as well as the NIR for both with and without the patch?
Created attachment 138205 [details] Shader dump without any changes
Created attachment 138206 [details] Shader dump with the patch
Hi, Jason. Dumps are attached. Can you please look at it ?
In the two dumps you provided, there's quite a few differences. It would help if we could narrow it down a bit and figure out which vertex shader is causing the problem. Hopefully, it's one of the small ones. :-) I recommend putting a little check in try_coalesce in nir_lower_vec_to_movs.c to look at shader->info.name return early for one shader at a time such as: if (strcmp(shader->info.name, "GLSL8") == 0) return; That way we can narrow it down to the exact shader that's causing problems. From there, sometimes it's useful to narrow it down further with something like if (strcmp(shader->info.name, "GLSL8") == 0) { static int i = 0; i++; if (i > 8 && i < 20) return; } Until you get down to the smallest change you can manage which still reproduces the bug.
(In reply to Jason Ekstrand from comment #8) > In the two dumps you provided, there's quite a few differences. It would > help if we could narrow it down a bit and figure out which vertex shader is > causing the problem. Hopefully, it's one of the small ones. :-) > > I recommend putting a little check in try_coalesce in > nir_lower_vec_to_movs.c to look at shader->info.name return early for one > shader at a time such as: > > if (strcmp(shader->info.name, "GLSL8") == 0) > return; > > That way we can narrow it down to the exact shader that's causing problems. > From there, sometimes it's useful to narrow it down further with something > like > > if (strcmp(shader->info.name, "GLSL8") == 0) { > static int i = 0; > i++; > if (i > 8 && i < 20) > return; > } > > Until you get down to the smallest change you can manage which still > reproduces the bug. Thanks for the hint! Looks like issue is in GLSL18 shader.
Created attachment 138307 [details] [review] Proposed patch
Found out that issue is presented in GLSL18 shader and it is related to mmov optimization for fadd command. Disabling optimization when src and dest are the same register fixing the issue.
This is fixed with the following commit in master: commit 800df942eadc5356840f5cbc2ceaa8a65c01ee91 Author: Jason Ekstrand <jason.ekstrand@intel.com> Date: Fri Mar 23 11:05:04 2018 -0700 nir/lower_vec_to_movs: Only coalesce if the vec had a SSA destination Otherwise we may end up trying to coalesce in a case such as ssa_1 = fadd r1, r2 r3.x = fneg(r2); r3 = vec4(ssa_1, ssa_1.y, ...) and that would cause us to move the writes to r3 from the vec to the fadd which would re-order them with respect to the write from the fneg. In order to solve this, we just don't coalesce if the destination of the vec is not SSA. We could try to get clever and still coalesce if there are no writes to the destination of the vec between the vec and the ALU source. However, since registers only come from phi webs and indirects, the chances of having a vec with a register destination that is actually coalescable into its source is very slim. Shader-db results on Haswell: total instructions in shared programs: 13657906 -> 13659101 (<.01%) instructions in affected programs: 149291 -> 150486 (0.80%) helped: 0 HURT: 592 Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=105440 Fixes: 2458ea95c56 "nir/lower_vec_to_movs: Coalesce movs on-the-fly when possible" Reported-by: Vadym Shovkoplias <vadym.shovkoplias@globallogic.com> Tested-by: Vadym Shovkoplias <vadym.shovkoplias@globallogic.com> Reviewed-by: Matt Turner <mattst88@gmail.com>
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.