Summary: | [NV96] Artifacts in output of fragment program containing not unrolled loops with conditional break | ||
---|---|---|---|
Product: | Mesa | Reporter: | Grzegorz Wójcik <gzregozrw> |
Component: | Drivers/DRI/nouveau | Assignee: | Nouveau Project <nouveau> |
Status: | RESOLVED MOVED | QA Contact: | |
Severity: | normal | ||
Priority: | medium | ||
Version: | git | ||
Hardware: | x86-64 (AMD64) | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
apitrace of simple test program.
dmesg artifacts output fomr llvmpipe |
Created attachment 98302 [details]
dmesg
Created attachment 98303 [details]
artifacts
Created attachment 98304 [details]
output fomr llvmpipe
The relevant TGSI shader: FRAG PROPERTY FS_COLOR0_WRITES_ALL_CBUFS 1 DCL IN[0], GENERIC[20], PERSPECTIVE DCL OUT[0], COLOR DCL CONST[0] DCL TEMP[0..13], LOCAL IMM[0] FLT32 { 1.0000, 0.0000, 200.0000, 234895.7188} IMM[1] FLT32 { 231.0000, 33.0000, 1.1000, 0.9800} IMM[2] FLT32 { 0.0050, 0.1000, 0.0000, 0.0000} 0: MOV TEMP[0], IMM[0].xxxx 1: MOV TEMP[1].xy, IN[0].xyxx 2: MOV TEMP[2].x, IMM[0].yyyy 3: BGNLOOP :0 4: FSGE TEMP[3].x, TEMP[2].xxxx, IMM[0].zzzz 5: UIF TEMP[3].xxxx :0 6: BRK 7: ENDIF 8: FSLT TEMP[4].x, CONST[0].xxxx, TEMP[2].xxxx 9: UIF TEMP[4].xxxx :0 10: BRK 11: ENDIF 12: MUL TEMP[5].x, TEMP[1].xxxx, IMM[1].xxxx 13: ADD TEMP[6].x, TEMP[2].xxxx, IMM[1].yyyy 14: MUL TEMP[7].x, TEMP[5].xxxx, TEMP[6].xxxx 15: MAD TEMP[8].x, TEMP[1].yyyy, IMM[0].wwww, TEMP[7].xxxx 16: ADD TEMP[9].x, TEMP[2].xxxx, IMM[1].zzzz 17: MUL TEMP[10].x, TEMP[8].xxxx, TEMP[9].xxxx 18: COS TEMP[11].x, TEMP[10].xxxx 19: FSLT TEMP[12].x, IMM[1].wwww, TEMP[11].xxxx 20: UIF TEMP[12].xxxx :0 21: MOV TEMP[0].xyw, IMM[0].yyyx 22: MUL TEMP[13].x, TEMP[2].xxxx, IMM[2].xxxx 23: MOV TEMP[0].z, TEMP[13].xxxx 24: BRK 25: ENDIF 26: ADD TEMP[2].x, TEMP[2].xxxx, IMM[0].xxxx 27: ENDLOOP :0 28: ADD TEMP[0].x, TEMP[0].xxxx, IMM[2].yyyy 29: MOV OUT[0], TEMP[0] 30: END and the envydis output of the generated shader (on my G96, with ~mesa-git): 00000000: 80000000 interp $r0 v[0x0] 00000004: 90000000 rcp f32 $r0 $r0 00000008: 1000800d 03f80003 mov b32 $r3 0x3f800000 00000010: 82010004 interp $r1 v[0x4] $r0 00000014: 82020000 interp $r0 v[0x8] $r0 00000018: 10008009 00000003 mov b32 $r2 0x0 00000020: 40024003 00000000 breakaddr 0x120 00000028: 10008011 04348003 B mov b32 $r4 0x43480000 00000030: b00405fd 600187c8 set $c0 # ge f32 $r2 $r4 00000038: 10000601 0403c280 (lg $c0) mov b32 $r0 $r3 00000040: 10000605 0403c280 (lg $c0) mov b32 $r1 $r3 00000048: 10000609 0403c280 (lg $c0) mov b32 $r2 $r3 00000050: 50000003 00000280 (lg $c0) break 00000058: 10000011 2400c780 ld $r4 b32 c0[0x0] 00000060: b00405fd 600107c8 set $c0 # g f32 $r2 $r4 00000068: 10000601 0403c280 (lg $c0) mov b32 $r0 $r3 00000070: 10000605 0403c280 (lg $c0) mov b32 $r1 $r3 00000078: 10000609 0403c280 (lg $c0) mov b32 $r2 $r3 00000080: 50000003 00000280 (lg $c0) break 00000088: c0000211 04367003 mul f32 $r4 $r1 0x43670000 00000090: b0000415 04204003 add f32 $r5 $r2 0x42040000 00000098: c0050811 00000780 mul rn f32 $r4 $r4 $r5 000000a0: 102e8015 0486563f mov b32 $r5 0x486563ee 000000a8: e0050011 00010780 add f32 $r4 (mul $r0 $r5) $r4 000000b0: b00d0415 03f8cccf add f32 $r5 $r2 0x3f8ccccd 000000b8: c0050811 00000780 mul rn f32 $r4 $r4 $r5 000000c0: b0000811 c0000780 presin f32 $r4 $r4 000000c8: 90000811 a0000780 cos f32 $r4 $r4 000000d0: 10088015 03f7ae17 mov b32 $r5 0x3f7ae148 000000d8: b00509fd 600107c8 set $c0 # g f32 $r4 $r5 000000e0: 10022003 00000100 (e $c0) bra 0x110 000000e8: 10008005 00000003 mov b32 $r1 0x0 000000f0: 1000800d 03f80003 mov b32 $r3 0x3f800000 000000f8: c00a0409 03ba3d73 mul f32 $r2 $r2 0x3ba3d70a 00000100: 10000201 0403c780 mov b32 $r0 $r1 00000108: 50000003 00000780 break 00000110: b0000409 03f80003 B add f32 $r2 $r2 0x3f800000 00000118: 10005003 00000780 bra 0x28 00000120: b00d0001 03dccccf B add f32 $r0 $r0 0x3dcccccd 00000128: f0000001 e0000001 exit (never) nop Forcing break to not have prefixes doesn't fix things, btw, but _does_ improve fps by like 10%+. Weird. Perhaps the prefixes aren't such a great win beyond a certain number of instructions. Visually, the results are at least very similar, if not identical. This is the shader if OP_BREAK is added to the noPredList: 00000000: 80000000 interp $r0 v[0x0] 00000004: 90000000 rcp f32 $r0 $r0 00000008: 1000800d 03f80003 mov b32 $r3 0x3f800000 00000010: 82010004 interp $r1 v[0x4] $r0 00000014: 82020000 interp $r0 v[0x8] $r0 00000018: 10008009 00000003 mov b32 $r2 0x0 00000020: 40024003 00000000 breakaddr 0x120 00000028: 10008011 04348003 B mov b32 $r4 0x43480000 00000030: b00405fd 600187c8 set $c0 # ge f32 $r2 $r4 00000038: 1000b003 00000100 (e $c0) bra 0x58 00000040: 10008600 mov b32 $r0 $r3 00000044: 10008604 mov b32 $r1 $r3 00000048: 10000609 0403c780 mov b32 $r2 $r3 00000050: 50000003 00000780 break 00000058: 10000011 2400c780 B ld $r4 b32 c0[0x0] 00000060: b00405fd 600107c8 set $c0 # g f32 $r2 $r4 00000068: 10011003 00000100 (e $c0) bra 0x88 00000070: 10008600 mov b32 $r0 $r3 00000074: 10008604 mov b32 $r1 $r3 00000078: 10000609 0403c780 mov b32 $r2 $r3 00000080: 50000003 00000780 break 00000088: c0000211 04367003 B mul f32 $r4 $r1 0x43670000 00000090: b0000415 04204003 add f32 $r5 $r2 0x42040000 00000098: c0050811 00000780 mul rn f32 $r4 $r4 $r5 000000a0: 102e8015 0486563f mov b32 $r5 0x486563ee 000000a8: e0050011 00010780 add f32 $r4 (mul $r0 $r5) $r4 000000b0: b00d0415 03f8cccf add f32 $r5 $r2 0x3f8ccccd 000000b8: c0050811 00000780 mul rn f32 $r4 $r4 $r5 000000c0: b0000811 c0000780 presin f32 $r4 $r4 000000c8: 90000811 a0000780 cos f32 $r4 $r4 000000d0: 10088015 03f7ae17 mov b32 $r5 0x3f7ae148 000000d8: b00509fd 600107c8 set $c0 # g f32 $r4 $r5 000000e0: 10022003 00000100 (e $c0) bra 0x110 000000e8: 10008005 00000003 mov b32 $r1 0x0 000000f0: 1000800d 03f80003 mov b32 $r3 0x3f800000 000000f8: c00a0409 03ba3d73 mul f32 $r2 $r2 0x3ba3d70a 00000100: 10000201 0403c780 mov b32 $r0 $r1 00000108: 50000003 00000780 break 00000110: b0000409 03f80003 B add f32 $r2 $r2 0x3f800000 00000118: 10005003 00000780 bra 0x28 00000120: b00d0001 03dccccf B add f32 $r0 $r0 0x3dcccccd 00000128: f0000001 e0000001 exit (never) nop Next step is to see what the blob compiler does with this. Fail. Their compiler is a lot smarter than ours. (Note that I'm pretty sure their code starts at 0x100, which is why the branch destinations are all off.) 00000000: 10008005 03f80003 mov b32 $r1 0x3f800000 00000008: 8000000c interp $r3 v[0x0] 0000000c: 1000fe10 mov b32 $r4 $r63 00000010: 10008208 mov b32 $r2 $r1 00000014: 10008200 mov b32 $r0 $r1 00000018: 90000615 00000780 rcp f32 $r5 $r3 00000020: b08009fd 604107c8 set $c0 # g f32 $r4 c1[0x0] 00000028: 10039003 00000680 (lgu $c0) bra 0x1c8 00000030: b000080d 04204003 add f32 $r3 $r4 0x42040000 00000038: 82010a18 interp $r6 v[0x4] $r5 0000003c: 82020a1c interp $r7 v[0x8] $r5 00000040: c0060621 00000780 mul rn f32 $r8 $r3 $r6 00000048: c02e0e0d 0486563f mul f32 $r3 $r7 0x486563ee 00000050: b00d0819 03f8cccf add f32 $r6 $r4 0x3f8ccccd 00000058: e000100d 04367003 add f32 $r3 (mul $r8 0x43670000) $r3 00000060: c0030c0d 00000780 mul rn f32 $r3 $r6 $r3 00000068: b000060d c0000780 presin f32 $r3 $r3 00000070: 9000060d a0000780 cos f32 $r3 $r3 00000078: b08107fd 60c107c8 set $c0 # g f32 $r3 c3[0x4] 00000080: c0830809 00c00680 (lgu $c0) mul rn f32 $r2 $r4 c3[0xc] 00000088: 10007e05 0403c680 (lgu $c0) mov b32 $r1 $r63 00000090: 10007e01 0403c680 (lgu $c0) mov b32 $r0 $r63 00000098: 10039003 00000680 (lgu $c0) bra 0x1c8 000000a0: b0000811 03f80003 add f32 $r4 $r4 0x3f800000 000000a8: b082080d 60c04780 set $r3 l f32 $r4 c3[0x8] 000000b0: a000060d 04114780 cvt abs u32 $r3 s32 $r3 000000b8: 303f07fd 640087c8 set $c0 # e u32 $r3 $r63 000000c0: 10024003 00000700 (geu $c0) bra 0x120 000000c8: b00d0001 03dccccf add f32 $r0 $r0 0x3dcccccd 000000d0: 1000080d 24c0c781 exit ld $r3 b32 c3[0x10] So... I guess the next move is to write a horrendously twisted glsl program that even their compiler can't avoid breaks on, and see what happens. And for the record, getting rid of the prebreak/break thing and replacing it with bra's doesn't fix things either: --- a/src/gallium/drivers/nouveau/codegen/nv50_ir_lowering_nv50.cpp +++ b/src/gallium/drivers/nouveau/codegen/nv50_ir_lowering_nv50.cpp @@ -1266,8 +1266,10 @@ NV50LoweringPreSSA::visit(Instruction *i) return handleWRSV(i); case OP_CALL: return handleCALL(i); + case OP_PREBREAK: case OP_PRECONT: return handlePRECONT(i); + case OP_BREAK: case OP_CONT: return handleCONT(i); case OP_PFETCH: results in identical-looking results (but slower), generated by: 00000000: 80000000 interp $r0 v[0x0] 00000004: 90000000 rcp f32 $r0 $r0 00000008: 1000800d 03f80003 mov b32 $r3 0x3f800000 00000010: 82010004 interp $r1 v[0x4] $r0 00000014: 82020000 interp $r0 v[0x8] $r0 00000018: 10008009 00000003 mov b32 $r2 0x0 00000020: 10008011 04348003 B mov b32 $r4 0x43480000 00000028: b00405fd 600187c8 set $c0 # ge f32 $r2 $r4 00000030: 10000601 0403c280 (lg $c0) mov b32 $r0 $r3 00000038: 10000605 0403c280 (lg $c0) mov b32 $r1 $r3 00000040: 10000609 0403c280 (lg $c0) mov b32 $r2 $r3 00000048: 10023003 00000280 (lg $c0) bra 0x118 00000050: 10000011 2400c780 ld $r4 b32 c0[0x0] 00000058: b00405fd 600107c8 set $c0 # g f32 $r2 $r4 00000060: 10000601 0403c280 (lg $c0) mov b32 $r0 $r3 00000068: 10000605 0403c280 (lg $c0) mov b32 $r1 $r3 00000070: 10000609 0403c280 (lg $c0) mov b32 $r2 $r3 00000078: 10023003 00000280 (lg $c0) bra 0x118 00000080: c0000211 04367003 mul f32 $r4 $r1 0x43670000 00000088: b0000415 04204003 add f32 $r5 $r2 0x42040000 00000090: c0050811 00000780 mul rn f32 $r4 $r4 $r5 00000098: 102e8015 0486563f mov b32 $r5 0x486563ee 000000a0: e0050011 00010780 add f32 $r4 (mul $r0 $r5) $r4 000000a8: b00d0415 03f8cccf add f32 $r5 $r2 0x3f8ccccd 000000b0: c0050811 00000780 mul rn f32 $r4 $r4 $r5 000000b8: b0000811 c0000780 presin f32 $r4 $r4 000000c0: 90000811 a0000780 cos f32 $r4 $r4 000000c8: 10088015 03f7ae17 mov b32 $r5 0x3f7ae148 000000d0: b00509fd 600107c8 set $c0 # g f32 $r4 $r5 000000d8: 10021003 00000100 (e $c0) bra 0x108 000000e0: 10008005 00000003 mov b32 $r1 0x0 000000e8: 1000800d 03f80003 mov b32 $r3 0x3f800000 000000f0: c00a0409 03ba3d73 mul f32 $r2 $r2 0x3ba3d70a 000000f8: 10000201 0403c780 mov b32 $r0 $r1 00000100: 10023003 00000780 bra 0x118 00000108: b0000409 03f80003 B add f32 $r2 $r2 0x3f800000 00000110: 10004003 00000780 bra 0x20 00000118: b00d0001 03dccccf B add f32 $r0 $r0 0x3dcccccd 00000120: f0000001 e0000001 exit (never) nop I guess it's time to look at this a little more carefully. Can one of the people with the issue try running an affected program with NOUVEAU_SHADER_WATCHDOG=false in the environment? and/or changing the 0x18 value written to WATCHDOG_TIMER in nv50_screen.c to 0x1e. For record (discussed on IRC), this is still rendering improperly with the proposed changes above. -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1064. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 98301 [details] apitrace of simple test program. Problem appear in shaders containing loops with conditional break when loop is not unrolled by nouveau. I checked emitted code with ST_DEBUG=tgsi NV50_PROG_DEBUG=1. Tested with Mesa 10.2.0-devel (git-475f5ff) from 2014-04-30. This bug exists for more than year, I first saw it in 2013-02. I think that artifacts appear in areas where is big difference in number of iterations executed for pixel before breaking from loop.