Bug 102289

Summary: SynMark CSDof misrenders with "i965/surface_state: Get the aux usage from the miptree code"
Product: Mesa Reporter: Eero Tamminen <eero.t.tamminen>
Component: Drivers/DRI/i965Assignee: Jason Ekstrand <jason>
Status: VERIFIED FIXED QA Contact: Intel 3D Bugs Mailing List <intel-3d-bugs>
Severity: normal    
Priority: medium    
Version: git   
Hardware: Other   
OS: All   
See Also: https://bugs.freedesktop.org/show_bug.cgi?id=102258
Whiteboard:
i915 platform: i915 features:
Attachments: Screenshot (scaled to 1/4 size)

Description Eero Tamminen 2017-08-18 09:46:37 UTC
Created attachment 133599 [details]
Screenshot (scaled to 1/4 size)

After mid-July, SynMark v7 CSDof has started to fail render validation.  This happens on BXT, GLK, SKL GT2 and KBL GT2:
------------------------------
$ ./synmark2 OglCSDof
...
Validation: Failed - different (1.062 %), different (0.940 %)
------------------------------

It doesn't seem to happen with earlier generations or eDRAM machines.


On every machine & commit the misrendering looks identical to the attached screenshot.  Issue is Mesa specific.  There are no errors from kernel.

Bisecting shows following commit to be he culprit:
------------------------------
commit dd75edb42982c1420168d509a7589032f7ead289
Author:     Jason Ekstrand <jason.ekstrand@intel.com>
AuthorDate: Wed Jun 21 21:23:20 2017 -0700
Commit:     Jason Ekstrand <jason.ekstrand@intel.com>
CommitDate: Sat Jul 22 20:59:22 2017 -0700

    i965/surface_state: Get the aux usage from the miptree code
    
    Reviewed-by: Topi Pohjolainen <topi.pohjolainen@intel.com>
------------------------------

When running the tests without validation, some of the frames look fine and some have the misrender on top of it.  SynMark validation doesn't always catch the misrendering, I guess in that case the first frame (used for validation) doesn't happen to have the error.

I'm filing this separately in case it requires fix separate from bug 102258 perf drop.
Comment 1 Eero Tamminen 2017-08-18 12:19:43 UTC
Correction, this happens on all GEN9+ machine except SKL GT4e (SkullCanyon).  

GT3e machines never failed SynMark validation, but the issue is still there too, just in fewer frames.
Comment 2 Jason Ekstrand 2017-08-18 23:35:18 UTC
Patch on the list:

https://patchwork.freedesktop.org/patch/172549/

Good catch!  That was a nasty one that's been there for a couple of years.  The only reason we didn't catch it before was that we had an optimization which was papering over the real issue.  When I accidentally removed that optimization in dd75edb42982c1, the paper came off and the bug appeared.
Comment 3 Jason Ekstrand 2017-08-19 00:34:24 UTC
This is fixed by the following commit in master:

commit d5e217dbfda2a87e149bdc8586c25143fc0ac183
Author: Jason Ekstrand <jason.ekstrand@intel.com>
Date:   Fri Aug 18 16:10:39 2017 -0700

    i965: Stop looking at NewDriverState when emitting 3DSTATE_URB
    
    Looking at NewDriverState is not safe in general.  The state atom system
    is set up to ensure that new bits that get added to NewDriverState get
    accumulated into the set of bits used when emitting atoms but it doesn't
    go the other way.  If we read NewDriverState, we may not get the full
    picture because the per-pipeline state (3D or compute) does not get
    added to NewDriverState before state emit is done.  It's especially
    dangerous to do this from BLORP (either explicitly or implicitly when
    BLORP calls gen7_upload_urb) because that does not happen during one of
    the normal state upload paths.
    
    This commit solves the problem by whacking all of the per-shader-stage
    URB sizes to zero whenever we change the total URB size.  We still have
    to flag BRW_NEW_URB_SIZE to ensure that the gen7_urb atom triggers but
    the actual decision in gen7_upload_urb can now be based entirely on URB
    sizes rather than on state atoms.  This also makes BLORP correct because
    it just asks for a new URB config whenever the vsize is too small and so
    any change to the total URB size will trigger blorp to re-emit as well
    because 0 < vs_entry_size.
    
    Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
    Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
    Bugzilla: https://bugs.freedesktop.org/102289
    Cc: mesa-stable@lists.freedesktop.org
Comment 4 Kenneth Graunke 2017-08-19 04:03:01 UTC
Note that this bug could have caused rendering corruption or possibly hangs in programs which use compute shaders.  It's an important fix that may end up solving a lot of other issues as well.
Comment 5 Eero Tamminen 2017-08-21 10:31:50 UTC
Verified, CSDof renders now fine without rendering errors.

(As to Vulkan, Sacha Willem's compute shader tests have still some reliability problems on BSW & BYT.)
Comment 6 Topi Pohjolainen 2017-08-21 12:59:41 UTC
I'll give it a go.

Meanwhile I'll note here some of my findings. First hang I got was actually in blit ring (coming from readpixels()). I simply made the actual blit a no-op to see if I get similar hangs I saw with HSW.

What I got looks very different, seems that batchbuffer contains garbage
(0x423303d0 doesn't look like valid op-code at all):

batchbuffer (render ring (submitted by testfw_app [3096])) at 0x1aa4b000
0x1aa4b000:      0x423303d0: 2D UNKNOWN
0x1aa4b004:      0xc2a90ba2: UNKNOWN
0x1aa4b008:      0x43164f2a: 2D UNKNOWN
0x1aa4b00c:      0x43168086: 2D UNKNOWN
0x1aa4b010:      0x423303d0: 2D UNKNOWN
0x1aa4b014:      0xc2a90ba2: UNKNOWN
0x1aa4b018:      0x43164f2a: 2D UNKNOWN
0x1aa4b01c:      0x43168086: 2D UNKNOWN
0x1aa4b020:      0x42393ff0: 2D UNKNOWN
0x1aa4b024:      0xc2b2caae: UNKNOWN
0x1aa4b028:      0x43199be2: 2D UNKNOWN
0x1aa4b02c:      0x4319cd34: 2D UNKNOWN
0x1aa4b030:      0x42393ff0: 2D UNKNOWN
0x1aa4b034:      0xc2b2caae: UNKNOWN
0x1aa4b038:      0x43199be2: 2D UNKNOWN
0x1aa4b03c:      0x4319cd34: 2D UNKNOWN
0x1aa4b040:      0x422f8890: 2D UNKNOWN
0x1aa4b044:      0xc2a6efdd: UNKNOWN
0x1aa4b048:      0x43189bbb: 2D UNKNOWN
0x1aa4b04c:      0x4318cd10: 2D UNKNOWN
0x1aa4b050:      0x422f8890: 2D UNKNOWN
0x1aa4b054:      0xc2a6efdd: UNKNOWN
0x1aa4b058:      0x43189bbb: 2D UNKNOWN
0x1aa4b05c:      0x4318cd10: 2D UNKNOWN
0x1aa4b060:      0xc1208f4b: UNKNOWN
0x1aa4b064:      0xc132a236: UNKNOWN
0x1aa4b068:      0x431b235e: 2D UNKNOWN
0x1aa4b06c:      0x431b54aa: 2D UNKNOWN
0x1aa4b070:      0x00000000: MI_NOOP
0x1aa4b074:      0x7a000003: PIPE_CONTROL
Comment 7 Topi Pohjolainen 2017-08-21 13:00:49 UTC
Sorry, wrong bug, wrote into wrong tab :(

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.