Created attachment 133599 [details] Screenshot (scaled to 1/4 size) After mid-July, SynMark v7 CSDof has started to fail render validation. This happens on BXT, GLK, SKL GT2 and KBL GT2: ------------------------------ $ ./synmark2 OglCSDof ... Validation: Failed - different (1.062 %), different (0.940 %) ------------------------------ It doesn't seem to happen with earlier generations or eDRAM machines. On every machine & commit the misrendering looks identical to the attached screenshot. Issue is Mesa specific. There are no errors from kernel. Bisecting shows following commit to be he culprit: ------------------------------ commit dd75edb42982c1420168d509a7589032f7ead289 Author: Jason Ekstrand <jason.ekstrand@intel.com> AuthorDate: Wed Jun 21 21:23:20 2017 -0700 Commit: Jason Ekstrand <jason.ekstrand@intel.com> CommitDate: Sat Jul 22 20:59:22 2017 -0700 i965/surface_state: Get the aux usage from the miptree code Reviewed-by: Topi Pohjolainen <topi.pohjolainen@intel.com> ------------------------------ When running the tests without validation, some of the frames look fine and some have the misrender on top of it. SynMark validation doesn't always catch the misrendering, I guess in that case the first frame (used for validation) doesn't happen to have the error. I'm filing this separately in case it requires fix separate from bug 102258 perf drop.
Correction, this happens on all GEN9+ machine except SKL GT4e (SkullCanyon). GT3e machines never failed SynMark validation, but the issue is still there too, just in fewer frames.
Patch on the list: https://patchwork.freedesktop.org/patch/172549/ Good catch! That was a nasty one that's been there for a couple of years. The only reason we didn't catch it before was that we had an optimization which was papering over the real issue. When I accidentally removed that optimization in dd75edb42982c1, the paper came off and the bug appeared.
This is fixed by the following commit in master: commit d5e217dbfda2a87e149bdc8586c25143fc0ac183 Author: Jason Ekstrand <jason.ekstrand@intel.com> Date: Fri Aug 18 16:10:39 2017 -0700 i965: Stop looking at NewDriverState when emitting 3DSTATE_URB Looking at NewDriverState is not safe in general. The state atom system is set up to ensure that new bits that get added to NewDriverState get accumulated into the set of bits used when emitting atoms but it doesn't go the other way. If we read NewDriverState, we may not get the full picture because the per-pipeline state (3D or compute) does not get added to NewDriverState before state emit is done. It's especially dangerous to do this from BLORP (either explicitly or implicitly when BLORP calls gen7_upload_urb) because that does not happen during one of the normal state upload paths. This commit solves the problem by whacking all of the per-shader-stage URB sizes to zero whenever we change the total URB size. We still have to flag BRW_NEW_URB_SIZE to ensure that the gen7_urb atom triggers but the actual decision in gen7_upload_urb can now be based entirely on URB sizes rather than on state atoms. This also makes BLORP correct because it just asks for a new URB config whenever the vsize is too small and so any change to the total URB size will trigger blorp to re-emit as well because 0 < vs_entry_size. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com> Bugzilla: https://bugs.freedesktop.org/102289 Cc: mesa-stable@lists.freedesktop.org
Note that this bug could have caused rendering corruption or possibly hangs in programs which use compute shaders. It's an important fix that may end up solving a lot of other issues as well.
Verified, CSDof renders now fine without rendering errors. (As to Vulkan, Sacha Willem's compute shader tests have still some reliability problems on BSW & BYT.)
I'll give it a go. Meanwhile I'll note here some of my findings. First hang I got was actually in blit ring (coming from readpixels()). I simply made the actual blit a no-op to see if I get similar hangs I saw with HSW. What I got looks very different, seems that batchbuffer contains garbage (0x423303d0 doesn't look like valid op-code at all): batchbuffer (render ring (submitted by testfw_app [3096])) at 0x1aa4b000 0x1aa4b000: 0x423303d0: 2D UNKNOWN 0x1aa4b004: 0xc2a90ba2: UNKNOWN 0x1aa4b008: 0x43164f2a: 2D UNKNOWN 0x1aa4b00c: 0x43168086: 2D UNKNOWN 0x1aa4b010: 0x423303d0: 2D UNKNOWN 0x1aa4b014: 0xc2a90ba2: UNKNOWN 0x1aa4b018: 0x43164f2a: 2D UNKNOWN 0x1aa4b01c: 0x43168086: 2D UNKNOWN 0x1aa4b020: 0x42393ff0: 2D UNKNOWN 0x1aa4b024: 0xc2b2caae: UNKNOWN 0x1aa4b028: 0x43199be2: 2D UNKNOWN 0x1aa4b02c: 0x4319cd34: 2D UNKNOWN 0x1aa4b030: 0x42393ff0: 2D UNKNOWN 0x1aa4b034: 0xc2b2caae: UNKNOWN 0x1aa4b038: 0x43199be2: 2D UNKNOWN 0x1aa4b03c: 0x4319cd34: 2D UNKNOWN 0x1aa4b040: 0x422f8890: 2D UNKNOWN 0x1aa4b044: 0xc2a6efdd: UNKNOWN 0x1aa4b048: 0x43189bbb: 2D UNKNOWN 0x1aa4b04c: 0x4318cd10: 2D UNKNOWN 0x1aa4b050: 0x422f8890: 2D UNKNOWN 0x1aa4b054: 0xc2a6efdd: UNKNOWN 0x1aa4b058: 0x43189bbb: 2D UNKNOWN 0x1aa4b05c: 0x4318cd10: 2D UNKNOWN 0x1aa4b060: 0xc1208f4b: UNKNOWN 0x1aa4b064: 0xc132a236: UNKNOWN 0x1aa4b068: 0x431b235e: 2D UNKNOWN 0x1aa4b06c: 0x431b54aa: 2D UNKNOWN 0x1aa4b070: 0x00000000: MI_NOOP 0x1aa4b074: 0x7a000003: PIPE_CONTROL
Sorry, wrong bug, wrote into wrong tab :(
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.