Bug 96624

Summary: [SKL, KBL] texture misrender in The Talos Principle
Product: Mesa Reporter: Grazvydas Ignotas <notasas>
Component: Drivers/DRI/i965Assignee: Intel 3D Bugs Mailing List <intel-3d-bugs>
Status: RESOLVED MOVED QA Contact: Intel 3D Bugs Mailing List <intel-3d-bugs>
Severity: normal    
Priority: medium CC: apinheiro, currojerez, jljusten, sergii.romantsov
Version: git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: screenshot
Workaround for trace Talos_flicker3_trim2.trace.xz

Description Grazvydas Ignotas 2016-06-21 22:10:28 UTC
Created attachment 124648 [details]
screenshot

The Talos Principle bug series continues from bug 96607 with another texture misrendering, see bottom-right corner of the attached screenshot. The texture is flickering between good and bad rendering while ingame.

This seems to be a Skylake specific issue as it doesn't reproduce on Haswell.
Does not seem to be a regression and shows up since SKL support was added to mesa. Same result on 4.4 and 4.6 kernels.

Screenshot is from the last frame of this trace:
https://drive.google.com/file/d/0Bz8fw_SGGDzsUi1keXg5VVdDRUU/view?usp=sharing
Comment 1 Grazvydas Ignotas 2016-06-25 21:11:22 UTC
Somewhat trimmed trace:
https://drive.google.com/file/d/0Bz8fw_SGGDzsbk81T0hKMFo1V2s/view?usp=sharing
bad draw call: 254494

It looks like some sort of caching/batching issue or something like that. While I was trimming the trace, the problem would randomly go away and return when removing draw calls in frames before the 'bad' draw call. The problem also goes away if draw call 254532 (which is after the 'bad' one) is removed. When inspecting the output with qapitrace, it occasionally also renders correctly, although most of the time it's wrong. Replaying the above trace with glretrace always triggers the issue though.
Comment 2 Grazvydas Ignotas 2016-06-27 01:06:09 UTC
I've noticed my trimmed trace triggers a debug assert, so here is another working one:
https://drive.google.com/file/d/0Bz8fw_SGGDzsTVlZZERZRTZiYkU/view?usp=sharing
bad draw call: 266049

It seems MESA_DEBUG=flush or setting always_flush_batch helps somewhat, but it still misrenders on around half of the tries.
Sticking glTextureBarrier after the bad call helps even more, but still doesn't completely solve it.
Comment 3 Alejandro PiƱeiro (freenode IRC: apinheiro) 2016-07-01 17:17:40 UTC
(In reply to Grazvydas Ignotas from comment #2)
> I've noticed my trimmed trace triggers a debug assert, so here is another
> working one:
> https://drive.google.com/file/d/0Bz8fw_SGGDzsTVlZZERZRTZiYkU/view?usp=sharing
> bad draw call: 266049
> 
> It seems MESA_DEBUG=flush or setting always_flush_batch helps somewhat, but
> it still misrenders on around half of the tries.
> Sticking glTextureBarrier after the bad call helps even more, but still
> doesn't completely solve it.

glTextureBarrier got a fix just today on mesa master. 

Additionally, some patches were sent related to the caching/flushing(still pending to get reviewed):
https://patchwork.freedesktop.org/patch/96042/
https://patchwork.freedesktop.org/patch/96041/
https://patchwork.freedesktop.org/patch/96044/
https://patchwork.freedesktop.org/patch/96043/

Just in case you want to test if they help.
Comment 4 Grazvydas Ignotas 2016-07-01 22:06:38 UTC
Just retested - unfortunately they don't help, not even a bit. Note that the game doesn't use glTextureBarrier, I was just hacking it in, which doesn't fix anything, just reduces the frequency of the problem.
Comment 5 Eero Tamminen 2017-02-17 15:15:23 UTC
Could you try again, e.g. Mesa 17.x?  I don't see this bug on SKL and there have been a lot of fixes both to Talos and Mesa Vulkan support since you filed the bug.
Comment 6 Grazvydas Ignotas 2017-02-18 00:32:40 UTC
(In reply to Eero Tamminen from comment #5)
> I don't see this bug on SKL
How many times have you tried? It's random doesn't reproduce reliably. The trimmed trace from Comment 2 is reproducing it (much) more often (I use glretrace -b -w).

And yes it's still reproducing on mesa-git and current version of the game. Curiously anv doesn't have this issue, but it suffers from general random object flicker and poor performance (around half of OpenGL).
Comment 7 Matt Turner 2017-03-23 18:17:40 UTC
I can reproduce using the trace on SKL/Mesa-17.0.2. It looks exactly like the screenshot.
Comment 8 Eero Tamminen 2017-09-11 08:20:16 UTC
(In reply to Matt Turner from comment #7)
> I can reproduce using the trace on SKL/Mesa-17.0.2. It looks exactly like
> the screenshot.

Matt, does the issue still happen or is it already fixed?
Comment 9 Grazvydas Ignotas 2017-09-11 19:25:47 UTC
Still there for me on today's git.
Comment 10 Sergii Romantsov 2018-04-02 09:49:43 UTC
Reproduced on KBL with provided apitraces:
https://drive.google.com/file/d/0Bz8fw_SGGDzsbk81T0hKMFo1V2s/view?usp=sharing
https://drive.google.com/file/d/0Bz8fw_SGGDzsTVlZZERZRTZiYkU/view?usp=sharing

System:    Kernel: 4.13.0-37-generic x86_64 (64 bit gcc: 5.4.0)
           Desktop: Unity 7.4.0 (Gtk 3.18.9-1ubuntu3.3) Distro: Ubuntu 16.04 xenial
CPU:       Dual core Intel Core i7-7500U (-HT-MCP-) cache: 4096 KB
           flags: (lm nx sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx) bmips: 11616
           clock speeds: max: 3500 MHz 1: 2900 MHz 2: 2900 MHz 3: 2900 MHz 4: 2900 MHz
Graphics:  Card-1: Intel Device 5916 bus-ID: 00:02.0
           GLX Renderer: Mesa DRI Intel HD Graphics 620 (Kaby Lake GT2)
           GLX Version: 3.0 GLX Version: 3.0 Mesa 18.1.0-devel (git-c88e7fe29e) Direct Rendering: Yes
Comment 11 Sergii Romantsov 2018-05-30 08:08:16 UTC
Created attachment 139842 [details] [review]
Workaround for trace Talos_flicker3_trim2.trace.xz

Investigation was done with trace https://drive.google.com/file/d/0Bz8fw_SGGDzsbk81T0hKMFo1V2s/view?usp=sharing
Workarounds (standalone):
1. Only for that trace (helps not for all cases, so sometimes still problem will happen): 96624_wa.diff
2. To use LIBGL_ALWAYS_SOFTWARE=true (swrast)

More investigations:
1. Used simplified shaders for program 1496 (vertex shader 1286, fragment shader 1296):
1.1. VS_f5145e822d6ab91eba69ffa69464742d03d0545a.glsl:
  - Left computation of vVexPosAbs
  - Assign vMaskLightUV.xy by vLightUV.xy
  - Avoid any more computations
1.2. FS_a6439012ab28d0bb9cc17c615c9b0487158f9cb1.glsl:
  - Use vMaskLightUV.xy to compute vOutColor.rgb
  - Avoid any more computations
2. Used array buffers:
2.1. 254490 @0 glBindBuffer(target = GL_ELEMENT_ARRAY_BUFFER, buffer = 1139)
2.2. 254517 @0 glBindBuffer(target = GL_ARRAY_BUFFER, buffer = 1140)

2. Final observation:
2.1. For vertex shader data in array vLightUV are invalid.
2.2. Reading of data by DRM_IOCTL_I915_GEM_PREA (for buffers 1139 and 1140) on differnet stages shows that data in arrays are not changed.

3. Temporal conclusion:
3.1. Or some invalid offsets/addresses computed
3.2. Or before glDraw*-calls missed some flushing/caching/command

4. Questions:
4.1. Is it possible to debug by which address vertex shader reads vLightUV?
4.2. Any more ideas?
Comment 12 Sergii Romantsov 2018-06-13 14:00:50 UTC
Bisected a363bb2cd0e2a141f2c60be005009703bffcbe4e (i965: Allocate VMA in userspace for full-PPGTT systems.)

That significantly improves behaviour but still fails once per about 10 times.
Comment 13 GitLab Migration User 2019-09-25 18:56:56 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1524.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.