Bug 78161

Summary: [NV96] Artifacts in output of fragment program containing not unrolled loops with conditional break
Product: Mesa Reporter: Grzegorz Wójcik <gzregozrw>
Component: Drivers/DRI/nouveauAssignee: Nouveau Project <nouveau>
Status: RESOLVED MOVED QA Contact:
Severity: normal    
Priority: medium    
Version: git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: apitrace of simple test program.
dmesg
artifacts
output fomr llvmpipe

Description Grzegorz Wójcik 2014-05-01 15:29:05 UTC
Created attachment 98301 [details]
apitrace of simple test program.

Problem appear in shaders containing loops with conditional break when loop is not unrolled by nouveau. I checked emitted code with ST_DEBUG=tgsi NV50_PROG_DEBUG=1.
Tested with Mesa 10.2.0-devel (git-475f5ff) from 2014-04-30.
This bug exists for more than year, I first saw it in 2013-02.
I think that artifacts appear in areas where is big difference in number of iterations executed for pixel before breaking from loop.
Comment 1 Grzegorz Wójcik 2014-05-01 15:29:57 UTC
Created attachment 98302 [details]
dmesg
Comment 2 Grzegorz Wójcik 2014-05-01 15:31:27 UTC
Created attachment 98303 [details]
artifacts
Comment 3 Grzegorz Wójcik 2014-05-01 15:32:50 UTC
Created attachment 98304 [details]
output fomr llvmpipe
Comment 4 Ilia Mirkin 2014-05-02 01:27:48 UTC
The relevant TGSI shader:

FRAG
PROPERTY FS_COLOR0_WRITES_ALL_CBUFS 1
DCL IN[0], GENERIC[20], PERSPECTIVE
DCL OUT[0], COLOR
DCL CONST[0]
DCL TEMP[0..13], LOCAL
IMM[0] FLT32 {    1.0000,     0.0000,   200.0000, 234895.7188}
IMM[1] FLT32 {  231.0000,    33.0000,     1.1000,     0.9800}
IMM[2] FLT32 {    0.0050,     0.1000,     0.0000,     0.0000}
  0: MOV TEMP[0], IMM[0].xxxx
  1: MOV TEMP[1].xy, IN[0].xyxx
  2: MOV TEMP[2].x, IMM[0].yyyy
  3: BGNLOOP :0
  4:   FSGE TEMP[3].x, TEMP[2].xxxx, IMM[0].zzzz
  5:   UIF TEMP[3].xxxx :0
  6:     BRK
  7:   ENDIF
  8:   FSLT TEMP[4].x, CONST[0].xxxx, TEMP[2].xxxx
  9:   UIF TEMP[4].xxxx :0
 10:     BRK
 11:   ENDIF
 12:   MUL TEMP[5].x, TEMP[1].xxxx, IMM[1].xxxx
 13:   ADD TEMP[6].x, TEMP[2].xxxx, IMM[1].yyyy
 14:   MUL TEMP[7].x, TEMP[5].xxxx, TEMP[6].xxxx
 15:   MAD TEMP[8].x, TEMP[1].yyyy, IMM[0].wwww, TEMP[7].xxxx
 16:   ADD TEMP[9].x, TEMP[2].xxxx, IMM[1].zzzz
 17:   MUL TEMP[10].x, TEMP[8].xxxx, TEMP[9].xxxx
 18:   COS TEMP[11].x, TEMP[10].xxxx
 19:   FSLT TEMP[12].x, IMM[1].wwww, TEMP[11].xxxx
 20:   UIF TEMP[12].xxxx :0
 21:     MOV TEMP[0].xyw, IMM[0].yyyx
 22:     MUL TEMP[13].x, TEMP[2].xxxx, IMM[2].xxxx
 23:     MOV TEMP[0].z, TEMP[13].xxxx
 24:     BRK
 25:   ENDIF
 26:   ADD TEMP[2].x, TEMP[2].xxxx, IMM[0].xxxx
 27: ENDLOOP :0
 28: ADD TEMP[0].x, TEMP[0].xxxx, IMM[2].yyyy
 29: MOV OUT[0], TEMP[0]
 30: END

and the envydis output of the generated shader (on my G96, with ~mesa-git):

00000000: 80000000              interp $r0 v[0x0]
00000004: 90000000              rcp f32 $r0 $r0
00000008: 1000800d 03f80003     mov b32 $r3 0x3f800000
00000010: 82010004              interp $r1 v[0x4] $r0
00000014: 82020000              interp $r0 v[0x8] $r0
00000018: 10008009 00000003     mov b32 $r2 0x0
00000020: 40024003 00000000     breakaddr 0x120
00000028: 10008011 04348003   B mov b32 $r4 0x43480000
00000030: b00405fd 600187c8     set $c0 # ge f32 $r2 $r4
00000038: 10000601 0403c280     (lg $c0) mov b32 $r0 $r3
00000040: 10000605 0403c280     (lg $c0) mov b32 $r1 $r3
00000048: 10000609 0403c280     (lg $c0) mov b32 $r2 $r3
00000050: 50000003 00000280     (lg $c0) break
00000058: 10000011 2400c780     ld $r4 b32 c0[0x0]
00000060: b00405fd 600107c8     set $c0 # g f32 $r2 $r4
00000068: 10000601 0403c280     (lg $c0) mov b32 $r0 $r3
00000070: 10000605 0403c280     (lg $c0) mov b32 $r1 $r3
00000078: 10000609 0403c280     (lg $c0) mov b32 $r2 $r3
00000080: 50000003 00000280     (lg $c0) break
00000088: c0000211 04367003     mul f32 $r4 $r1 0x43670000
00000090: b0000415 04204003     add f32 $r5 $r2 0x42040000
00000098: c0050811 00000780     mul rn f32 $r4 $r4 $r5
000000a0: 102e8015 0486563f     mov b32 $r5 0x486563ee
000000a8: e0050011 00010780     add f32 $r4 (mul $r0 $r5) $r4
000000b0: b00d0415 03f8cccf     add f32 $r5 $r2 0x3f8ccccd
000000b8: c0050811 00000780     mul rn f32 $r4 $r4 $r5
000000c0: b0000811 c0000780     presin f32 $r4 $r4
000000c8: 90000811 a0000780     cos f32 $r4 $r4
000000d0: 10088015 03f7ae17     mov b32 $r5 0x3f7ae148
000000d8: b00509fd 600107c8     set $c0 # g f32 $r4 $r5
000000e0: 10022003 00000100     (e $c0) bra 0x110
000000e8: 10008005 00000003     mov b32 $r1 0x0
000000f0: 1000800d 03f80003     mov b32 $r3 0x3f800000
000000f8: c00a0409 03ba3d73     mul f32 $r2 $r2 0x3ba3d70a
00000100: 10000201 0403c780     mov b32 $r0 $r1
00000108: 50000003 00000780     break
00000110: b0000409 03f80003   B add f32 $r2 $r2 0x3f800000
00000118: 10005003 00000780     bra 0x28
00000120: b00d0001 03dccccf   B add f32 $r0 $r0 0x3dcccccd
00000128: f0000001 e0000001     exit (never) nop
Comment 5 Ilia Mirkin 2014-05-02 01:44:32 UTC
Forcing break to not have prefixes doesn't fix things, btw, but _does_ improve fps by like 10%+. Weird. Perhaps the prefixes aren't such a great win beyond a certain number of instructions. Visually, the results are at least very similar, if not identical.

This is the shader if OP_BREAK is added to the noPredList:

00000000: 80000000              interp $r0 v[0x0]
00000004: 90000000              rcp f32 $r0 $r0
00000008: 1000800d 03f80003     mov b32 $r3 0x3f800000
00000010: 82010004              interp $r1 v[0x4] $r0
00000014: 82020000              interp $r0 v[0x8] $r0
00000018: 10008009 00000003     mov b32 $r2 0x0
00000020: 40024003 00000000     breakaddr 0x120
00000028: 10008011 04348003   B mov b32 $r4 0x43480000
00000030: b00405fd 600187c8     set $c0 # ge f32 $r2 $r4
00000038: 1000b003 00000100     (e $c0) bra 0x58
00000040: 10008600              mov b32 $r0 $r3
00000044: 10008604              mov b32 $r1 $r3
00000048: 10000609 0403c780     mov b32 $r2 $r3
00000050: 50000003 00000780     break
00000058: 10000011 2400c780   B ld $r4 b32 c0[0x0]
00000060: b00405fd 600107c8     set $c0 # g f32 $r2 $r4
00000068: 10011003 00000100     (e $c0) bra 0x88
00000070: 10008600              mov b32 $r0 $r3
00000074: 10008604              mov b32 $r1 $r3
00000078: 10000609 0403c780     mov b32 $r2 $r3
00000080: 50000003 00000780     break
00000088: c0000211 04367003   B mul f32 $r4 $r1 0x43670000
00000090: b0000415 04204003     add f32 $r5 $r2 0x42040000
00000098: c0050811 00000780     mul rn f32 $r4 $r4 $r5
000000a0: 102e8015 0486563f     mov b32 $r5 0x486563ee
000000a8: e0050011 00010780     add f32 $r4 (mul $r0 $r5) $r4
000000b0: b00d0415 03f8cccf     add f32 $r5 $r2 0x3f8ccccd
000000b8: c0050811 00000780     mul rn f32 $r4 $r4 $r5
000000c0: b0000811 c0000780     presin f32 $r4 $r4
000000c8: 90000811 a0000780     cos f32 $r4 $r4
000000d0: 10088015 03f7ae17     mov b32 $r5 0x3f7ae148
000000d8: b00509fd 600107c8     set $c0 # g f32 $r4 $r5
000000e0: 10022003 00000100     (e $c0) bra 0x110
000000e8: 10008005 00000003     mov b32 $r1 0x0
000000f0: 1000800d 03f80003     mov b32 $r3 0x3f800000
000000f8: c00a0409 03ba3d73     mul f32 $r2 $r2 0x3ba3d70a
00000100: 10000201 0403c780     mov b32 $r0 $r1
00000108: 50000003 00000780     break
00000110: b0000409 03f80003   B add f32 $r2 $r2 0x3f800000
00000118: 10005003 00000780     bra 0x28
00000120: b00d0001 03dccccf   B add f32 $r0 $r0 0x3dcccccd
00000128: f0000001 e0000001     exit (never) nop

Next step is to see what the blob compiler does with this.
Comment 6 Ilia Mirkin 2014-05-02 02:01:04 UTC
Fail. Their compiler is a lot smarter than ours. (Note that I'm pretty sure their code starts at 0x100, which is why the branch destinations are all off.)

00000000: 10008005 03f80003     mov b32 $r1 0x3f800000
00000008: 8000000c              interp $r3 v[0x0]
0000000c: 1000fe10              mov b32 $r4 $r63
00000010: 10008208              mov b32 $r2 $r1
00000014: 10008200              mov b32 $r0 $r1
00000018: 90000615 00000780     rcp f32 $r5 $r3
00000020: b08009fd 604107c8     set $c0 # g f32 $r4 c1[0x0]
00000028: 10039003 00000680     (lgu $c0) bra 0x1c8
00000030: b000080d 04204003     add f32 $r3 $r4 0x42040000
00000038: 82010a18              interp $r6 v[0x4] $r5
0000003c: 82020a1c              interp $r7 v[0x8] $r5
00000040: c0060621 00000780     mul rn f32 $r8 $r3 $r6
00000048: c02e0e0d 0486563f     mul f32 $r3 $r7 0x486563ee
00000050: b00d0819 03f8cccf     add f32 $r6 $r4 0x3f8ccccd
00000058: e000100d 04367003     add f32 $r3 (mul $r8 0x43670000) $r3
00000060: c0030c0d 00000780     mul rn f32 $r3 $r6 $r3
00000068: b000060d c0000780     presin f32 $r3 $r3
00000070: 9000060d a0000780     cos f32 $r3 $r3
00000078: b08107fd 60c107c8     set $c0 # g f32 $r3 c3[0x4]
00000080: c0830809 00c00680     (lgu $c0) mul rn f32 $r2 $r4 c3[0xc]
00000088: 10007e05 0403c680     (lgu $c0) mov b32 $r1 $r63
00000090: 10007e01 0403c680     (lgu $c0) mov b32 $r0 $r63
00000098: 10039003 00000680     (lgu $c0) bra 0x1c8
000000a0: b0000811 03f80003     add f32 $r4 $r4 0x3f800000
000000a8: b082080d 60c04780     set $r3 l f32 $r4 c3[0x8]
000000b0: a000060d 04114780     cvt abs u32 $r3 s32 $r3
000000b8: 303f07fd 640087c8     set $c0 # e u32 $r3 $r63
000000c0: 10024003 00000700     (geu $c0) bra 0x120
000000c8: b00d0001 03dccccf     add f32 $r0 $r0 0x3dcccccd
000000d0: 1000080d 24c0c781     exit ld $r3 b32 c3[0x10]

So... I guess the next move is to write a horrendously twisted glsl program that even their compiler can't avoid breaks on, and see what happens.
Comment 7 Ilia Mirkin 2014-05-02 02:40:50 UTC
And for the record, getting rid of the prebreak/break thing and replacing it with bra's doesn't fix things either:

--- a/src/gallium/drivers/nouveau/codegen/nv50_ir_lowering_nv50.cpp
+++ b/src/gallium/drivers/nouveau/codegen/nv50_ir_lowering_nv50.cpp
@@ -1266,8 +1266,10 @@ NV50LoweringPreSSA::visit(Instruction *i)
       return handleWRSV(i);
    case OP_CALL:
       return handleCALL(i);
+   case OP_PREBREAK:
    case OP_PRECONT:
       return handlePRECONT(i);
+   case OP_BREAK:
    case OP_CONT:
       return handleCONT(i);
    case OP_PFETCH:

results in identical-looking results (but slower), generated by:

00000000: 80000000              interp $r0 v[0x0]
00000004: 90000000              rcp f32 $r0 $r0
00000008: 1000800d 03f80003     mov b32 $r3 0x3f800000
00000010: 82010004              interp $r1 v[0x4] $r0
00000014: 82020000              interp $r0 v[0x8] $r0
00000018: 10008009 00000003     mov b32 $r2 0x0
00000020: 10008011 04348003   B mov b32 $r4 0x43480000
00000028: b00405fd 600187c8     set $c0 # ge f32 $r2 $r4
00000030: 10000601 0403c280     (lg $c0) mov b32 $r0 $r3
00000038: 10000605 0403c280     (lg $c0) mov b32 $r1 $r3
00000040: 10000609 0403c280     (lg $c0) mov b32 $r2 $r3
00000048: 10023003 00000280     (lg $c0) bra 0x118
00000050: 10000011 2400c780     ld $r4 b32 c0[0x0]
00000058: b00405fd 600107c8     set $c0 # g f32 $r2 $r4
00000060: 10000601 0403c280     (lg $c0) mov b32 $r0 $r3
00000068: 10000605 0403c280     (lg $c0) mov b32 $r1 $r3
00000070: 10000609 0403c280     (lg $c0) mov b32 $r2 $r3
00000078: 10023003 00000280     (lg $c0) bra 0x118
00000080: c0000211 04367003     mul f32 $r4 $r1 0x43670000
00000088: b0000415 04204003     add f32 $r5 $r2 0x42040000
00000090: c0050811 00000780     mul rn f32 $r4 $r4 $r5
00000098: 102e8015 0486563f     mov b32 $r5 0x486563ee
000000a0: e0050011 00010780     add f32 $r4 (mul $r0 $r5) $r4
000000a8: b00d0415 03f8cccf     add f32 $r5 $r2 0x3f8ccccd
000000b0: c0050811 00000780     mul rn f32 $r4 $r4 $r5
000000b8: b0000811 c0000780     presin f32 $r4 $r4
000000c0: 90000811 a0000780     cos f32 $r4 $r4
000000c8: 10088015 03f7ae17     mov b32 $r5 0x3f7ae148
000000d0: b00509fd 600107c8     set $c0 # g f32 $r4 $r5
000000d8: 10021003 00000100     (e $c0) bra 0x108
000000e0: 10008005 00000003     mov b32 $r1 0x0
000000e8: 1000800d 03f80003     mov b32 $r3 0x3f800000
000000f0: c00a0409 03ba3d73     mul f32 $r2 $r2 0x3ba3d70a
000000f8: 10000201 0403c780     mov b32 $r0 $r1
00000100: 10023003 00000780     bra 0x118
00000108: b0000409 03f80003   B add f32 $r2 $r2 0x3f800000
00000110: 10004003 00000780     bra 0x20
00000118: b00d0001 03dccccf   B add f32 $r0 $r0 0x3dcccccd
00000120: f0000001 e0000001     exit (never) nop

I guess it's time to look at this a little more carefully.
Comment 8 Ilia Mirkin 2015-03-20 14:04:23 UTC
Can one of the people with the issue try running an affected program with

NOUVEAU_SHADER_WATCHDOG=false

in the environment?
Comment 9 Ilia Mirkin 2015-03-20 14:09:13 UTC
and/or changing the 0x18 value written to WATCHDOG_TIMER in nv50_screen.c to 0x1e.
Comment 10 Nick Tenney 2015-03-28 02:18:09 UTC
For record (discussed on IRC), this is still rendering improperly with the proposed changes above.
Comment 11 GitLab Migration User 2019-09-18 20:39:36 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1064.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.