Bug 67887

Summary: bisected dota 2 regression
Product: Mesa Reporter: Vedran Rodic <vrodic>
Component: glsl-compilerAssignee: Paul Berry <stereotype441>
Status: RESOLVED NOTOURBUG QA Contact:
Severity: normal    
Priority: medium CC: alexandre.f.demers, idr
Version: git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: corruption

Description Vedran Rodic 2013-08-08 07:06:02 UTC
I was testing the latest Mesa git and found a regression in dota 2 that I narrowed down using git bisect to this commit:


http://cgit.freedesktop.org/mesa/mesa/commit/?id=6a2baf3a06125405aa648e208af782e53f1312c0

The image of the corruption is attached.

The bug can be seen in this retrace: 

http://mjesec.ffzg.hr/~vrodic/dota/jakiro-bug/dota_linux.2.trace.bz2

Of course, reverting the patch fixes the problem.
Comment 1 Vedran Rodic 2013-08-08 07:14:32 UTC
Created attachment 83810 [details]
corruption
Comment 2 Paul Berry 2013-08-09 00:04:28 UTC
I'm looking into this.
Comment 3 Paul Berry 2013-08-09 05:30:37 UTC
Ok, I've found the problem, but I think I need some more information to figure out what the proper fix is.

It turns out that the trace is relying on some undefined behaviour.  It uses a fragment shader whose only output is gl_FragData[0] to render to a framebuffer containing two color buffer attachments.  It seems to expect that the value written to gl_FragData[0] will be delivered to both color buffer attachments.  However, according to section 15.2.3 (Shader Outputs) of the GL 4.4 spec, that's not guaranteed to happen:

"Writing to gl_FragColor specifies the fragment color (color number zero)
that will be used by subsequent stages of the pipeline. Writing to gl_-
FragData[n] specifies the value of fragment color number n. Any colors, or
color components, associated with a fragment that are not written by the fragment shader are undefined."

It just so happens that before commit 6a2baf3a06125405aa648e208af782e53f1312c0, the undefined behaviour was, by coincidence, to replicate gl_FragData[0] in all color buffer attachments, so dota 2 worked in spite of relying on undefined behaviour.

The exact mechanics of the coincidence were as follows.  Prior to that commit, the linker would resize shader inputs and outputs based on the maximum array size used, so in the case of this shader it would resize gl_FragData to a vec4[1].  That in turn would cause the i965 back end to leave fs_visitor::outputs[1] unpopulated.  So when fs_visitor::emit_color_write() was called upon to copy a component of gl_FragData[1] to a message register (or pseudo-message register in Gen7), it would skip it, and the old data in the message register (which was leftover from gl_FragData[0]) would get re-used.  The reason this is such a coincidence is that it relies on the fact that all framebuffer writes are sent from the same set of message registers, due to a limitation in the i965 back end that we hope to fix one day.

After commit 6a2baf3a06125405aa648e208af782e53f1312c0, the linker no longer resizes gl_FragData, so it remains a vec4[8].  As a result, fs_visitor::outputs[1] gets populated with a virtual GRF, which never gets assigned to.  So when fs_visitor::emit_color_write() is called upon to copy a component of gl_FragData[1] to a message register, it copies uninitialized data from that virtual GRF.  As a result, garbage data gets written to the second color attachment.

Unfortunately, reverting commit 6a2baf3a06125405aa648e208af782e53f1312c0 isn't a practical fix--not only do geometry shaders depend on this commit, but the old code only produced correct results by coincidence, and it's easy to imagine us making some future Mesa improvement (such as sending different framebuffer writes from different registers) that would break things again.  It would be far better, if possible, to get the app to stop relying on undefined behaviour.

If we can't get traction on that, then as a last resort we can always add a workaround to Mesa that detects when multiple color attachments are used with a shader that only writes to gl_FragData[0], and explicitly replicates gl_FragData[0] to all color attachments.

Vedran, I am guessing that this is a DirectX app being run under WINE--is that correct?  If so, I'd like to try contacting the WINE folks next, to see if it's a bug in WINE that's causing the app to make use of undefined behaviour.
Comment 4 Vedran Rodic 2013-08-09 07:02:20 UTC
No, this is the Dota 2 native client trace. Sorry for that omission.
Comment 5 Vedran Rodic 2013-08-09 07:21:34 UTC
Wine version in contrast isn't hit by this. Thanks Paul for the detailed explanation. I guess it's best to talk to guys at Valve. 

It's possible that this undefined behavior is also present in AMD fglrx and nVidia proprietary drivers. I'll try to test that today.
Comment 6 Vedran Rodic 2013-08-09 08:16:24 UTC
I can't reproduce this on AMD fglrx 13.8 beta 1 with the same trace file (Radeon 6850), so I guess this undefined behavior is something that is assumed by Valve. 

I suspect that nVidia proprietary driver is the same, but I didn't check. I have to replace the card in my machine for that.
Comment 7 Paul Berry 2013-08-09 15:34:11 UTC
I tested your apitrace file on my NVIDIA system and did not see the bug, so it seems your suspicions are correct.

Given that AMD and NVIDIA appear to replicate gl_FragData[0] to all otherwise-undefined outputs, and Mesa's i965 back-end used to do so, it seems reasonable to me to restore the old behaviour, even though the spec doesn't require it.

I'll send out patches to Mesa-dev to do this later today.
Comment 9 Vedran Rodic 2013-08-09 22:39:03 UTC
If this patches are applied to current mesa git (ver git-5894898), the problem is still there.

If however if I use your git wip branch from here: https://github.com/stereotype441/mesa.git (ver git-355158d) the problem is fixed.
Comment 10 Paul Berry 2013-08-10 05:48:18 UTC
(In reply to comment #9)
> If this patches are applied to current mesa git (ver git-5894898), the
> problem is still there.
> 
> If however if I use your git wip branch from here:
> https://github.com/stereotype441/mesa.git (ver git-355158d) the problem is
> fixed.

Hmm, weird.  I tried applying the patches to git 5894898, and they fixed the problem for me.  Can you please have a look at branch "bugfix-67887" of https://github.com/stereotype441/mesa.git and see if that fixes the problem for you?  That's the tree I get when I apply the patches to git 5894898.
Comment 11 Vedran Rodic 2013-08-10 06:53:19 UTC
I guess i don't know how to patch. It seems that the second patch line from the first patch was missed when I was making a patch file from the mailing list.

I confirm the bug fixed with your patches.
Comment 12 Paul Berry 2013-08-10 15:03:00 UTC
(In reply to comment #11)
> I guess i don't know how to patch. It seems that the second patch line from
> the first patch was missed when I was making a patch file from the mailing
> list.
> 
> I confirm the bug fixed with your patches.

No problem.  Thanks for following up and checking my work.
Comment 13 Emil Velikov 2013-08-10 19:37:49 UTC
Hope you don't mind that I chip in

Exact same corruption is present with the nouveau(gallium/nv50) driver, and reverting the commit does restore rendering back to normal.

Unfortunately neither the patches on the mailing list, nor the wip/bugfix-67887 branches seems to address the issue in here. I'm suspecting that we need a patch similar to the intel specific one - [PATCH 2/2] i965/fs: Re-use output 0 for outputs without	a static write.

Paul would you have any tips on how to resolve the issue ? I'm guessing that other gallium users may be affected as well.

If you're interested in testing nouveau you'll need to apply this workaround [1] (from bug 64323)

Cheers
Emil

[1] https://bugs.freedesktop.org/attachment.cgi?id=83839
Comment 14 Ian Romanick 2013-08-12 21:11:47 UTC
I'm a bit conflicted about this.  Section 7.2 (Fragment Shader Special Variables) of the GLSL 1.20 spec says:

    "If subsequent fixed functionality consumes fragment data
    and an execution of a fragment shader executable does not
    write a value to it, then the fragment data consumed is
    undefined."

It's just luck that the i965 driver used to end up with gl_FragData[0] in the other slots as well.  I don't know if it's luck or intentional choice that the closed source drivers also end up with gl_FragData[0] in the other slots.  In either case, Dota2 is depending on undefined behavior.  The spec already requires that writes to gl_FragColor be written to all outputs, so they could just s/gl_FragData[0]/gl_FragColor/g in that shader.

If we decide to bring back the old behavior, I believe we should do this as a post-link "fix up" pass:

     If the shader contains only static accesses to gl_FragColor[0],
     replace those accesses with gl_FragData.

That way we shouldn't need any driver-specific code.

If we decide to bring back the old behavior, we should also work with Khronos to make the spec require it.

Let's see what Valve says before we proceed.
Comment 15 Paul Berry 2013-08-14 15:47:23 UTC
I just got in touch with one of the folks at Valve, and it sounds like they are planning to fix it on their end.

I'll hold the bug open until the new version of Dota2 ships, but I won't take any further action on it unless plans change.
Comment 16 Vedran Rodic 2013-08-22 21:03:23 UTC
This has been fixed in the latest Dota 2 test client (released 2013-08-22). 

Performance problems detailed in https://bugs.freedesktop.org/show_bug.cgi?id=68214 do persist still. Hopefully, you'll be able to work with Valve on those to at some point.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.