Bug 95414 - [SNA HSW] Corruption in OpenGL video output (DRI2 + TearFree)
Summary: [SNA HSW] Corruption in OpenGL video output (DRI2 + TearFree)
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-05-15 22:12 UTC by Matti Hämäläinen
Modified: 2016-11-08 09:03 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
Always flush the dirty CPU cache on flip (2.41 KB, patch)
2016-11-05 21:54 UTC, Chris Wilson
no flags Details | Splinter Review

Description Matti Hämäläinen 2016-05-15 22:12:01 UTC
After my main monitor with DVI-D input giving up the ghost today, I had to switch to older one that only has VGA input. This revealed some weird corruption issues in OpenGL video output of MPV and also other draw issues. Notably, Xv output with MPV does not seem to exhibit the problem. I haven't experienced any other problems with OpenGL - games such as QuakeForge / DarkPlaces, ET:Legacy, etc. work without visual issues.

As the only thing that has changed is switching from DVI to VGA, I can only assume that this is related. The graphics stack is reasonably up to date, DDX is from Git.

The below link has two video captures taken with a camera, demonstrating the GL video output weirdness. Also included are two full-debug Xorg logs.

http://tnsp.org/~ccr/intel-gfx/gl-ddx-vga/

Unfortunately I can only now test on VGA, until I get a new DVI/HDMI monitor.

-- Window manager: WindowMaker 0.95.6+b1
-- chipset: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor Integrated Graphics Controller (rev 06)
-- system architecture: x86-64 / 64bit
-- xf86-video-intel: GIT head/master 88733a7874f7c9b45da5d612802947a9de12893a
-- xserver: X.Org X Server 1.18.3-1 from latest Debian testing
-- mesa: 11.1.3-1 from Debian testing
-- libpixman: 0.33.6-1
-- libdrm version: 2.4.68-1
-- kernel version: 4.5.4 (vanilla+grsec)
-- Linux distribution: current Debian Testing
-- Machine or mobo model: Asus H97M-PLUS, core i7-4770
-- Display connector: VGA
Comment 1 Chris Wilson 2016-05-16 08:58:49 UTC
First thought is not guilty, y'honour. Looks to be doing an exchange on each swap buffer, returning a fresh backbuffer to the mpv client and then pageflipping the new front on the next TearFree flip.

Hmm.  Maybe related to:

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 24c28f4..bf125d9 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -3085,8 +3085,7 @@ out:
         * object is now coherent at its new cache level (with respect
         * to the access domain).
         */
-       if (obj->cache_dirty &&
-           obj->base.write_domain != I915_GEM_DOMAIN_CPU &&
+       if (obj->base.write_domain != I915_GEM_DOMAIN_CPU &&
            cpu_write_needs_clflush(obj)) {
                if (i915_gem_clflush_object(obj, true))
                        i915_gem_chipset_flush(to_i915(obj->base.dev));
Comment 2 Matti Hämäläinen 2016-05-16 09:04:32 UTC
Will check when I get home. However, the Geeqie-issue (which I promptly forgot to describe in detail ..) may be a completely separate bug. It presents as semi-large-ish rectangular areas of screen not updating. Also happens on other software too, occasionally, but less reliably.

Should I submit a separate ticket for that?
Comment 3 Matti Hämäläinen 2016-05-16 13:26:22 UTC
Your guess seems to be correct, the weirdness went away with that change in kernel.

The other draw issue remains, so maybe it's a DDX issue. I'll see if I can bisect it.
Comment 4 Matti Hämäläinen 2016-05-16 13:32:29 UTC
The downside of the obj->cache_dirty flag check change seems to be that now things (like moving a window) are noticeably "laggy".
Comment 5 Chris Wilson 2016-05-17 07:00:46 UTC
Hopefully

commit a508b11bde9f3119b49b3e0f652587efb9e037af
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue May 17 07:55:03 2016 +0100

    sna: Don't skip migration-to-GPU for TearFree
    
    In 46caee86db0f ("sna: Fix reporting of errno after setcrtc failure"),
    the intention was to avoid reporting a fail to migrate whilst wedged for
    a simple copy from the frontbuffer to TearFree's shadow buffer. However,
    by skipping the migration, we never flushed any dirt from the CPU buffer
    prior to doing the TearFree flip.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=95401#c7
    References: https://bugs.freedesktop.org/show_bug.cgi?id=95414#c4
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

fixes the TearFree glitches.
Comment 6 Matti Hämäläinen 2016-05-17 07:23:12 UTC
Confirming that this fixes the "rectangular areas not (re)painted" issue. The "noisy/glitchy" video playback issue is unaffected, obviously .. I really should've reported these as separate bugs, sorry.
Comment 7 Matti Hämäläinen 2016-05-17 07:30:14 UTC
One note about the video playback noise-glitching: happens only in fullscreen, apparently.

I'm hoping that a better solution can be found than the small kernel change, as it causes the nasty "lag" I mentioned. :S
Comment 8 Matti Hämäläinen 2016-05-28 20:01:47 UTC
Disabling TearFree seems to get rid of the corruption as well.
Comment 9 Matti Hämäläinen 2016-09-26 09:19:15 UTC
Just noting this down here: Apparently this issue is the result of DRI2 + TearFree, and does not occur with DRI3 + TearFree.
Comment 10 Chris Wilson 2016-11-05 21:54:22 UTC
Created attachment 127793 [details] [review]
Always flush the dirty CPU cache on flip

I'm becoming more convinced there is a bug the attached fixes... Hopefully this bug!
Comment 11 Matti Hämäläinen 2016-11-05 23:01:54 UTC
Sorry to disappoint .. I applied the patch against 4.7.10 (applied almost cleanly, just the first section needed to be adjusted manually.), but the corruption still appears with DRI2+TearFree + fullscreen video. :S

Does the patch perhaps assume a fresher kernel? If you need something else to be tested or so, just gimme a holler. :)

Perhaps I should note, though it is shown in the video captures, that the issue does not always instantly appear (and I can't seem to make it appear without a fullscreen window), but requires some "twiddling" e.g. seeking forwards and backwards in the video (whatever that causes MPV OpenGL output to do).
Comment 12 Chris Wilson 2016-11-06 08:49:52 UTC
Pretty please try https://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=clflush

3 more patches to try and clflush more often.
Comment 13 Matti Hämäläinen 2016-11-06 21:23:47 UTC
Sorry, still occurs with that kernel (it's 4.9.0-rc3-g7bf563a, so ought to be right branch and commit.) :/
Comment 14 Chris Wilson 2016-11-06 21:36:05 UTC
Stick with that kernel and try:

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index c642385..9adfe4b 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -3733,6 +3733,7 @@ struct i915_vma *
                obj->base.read_domains = I915_GEM_DOMAIN_CPU;
                obj->base.write_domain = I915_GEM_DOMAIN_CPU;
        }
+       obj->cache_dirty = true;
 
        trace_i915_gem_object_change_domain(obj,
                                            old_read_domains,
Comment 15 Chris Wilson 2016-11-06 21:37:04 UTC
That's i915_gem_object_set_to_cpu_domain() in case line numbers don't match up.
Comment 16 Matti Hämäläinen 2016-11-06 21:47:10 UTC
No such luck, still occurs. :|
Comment 17 Chris Wilson 2016-11-07 13:31:32 UTC
New theory:

--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -1282,6 +1282,7 @@ void i915_vma_move_to_active(struct i915_vma *vma,
 
                /* update for the implicit flush after a batch */
                obj->base.write_domain &= ~I915_GEM_GPU_DOMAINS;
+               obj->cache_dirty |= obj->cache_level != I915_CACHE_NONE;
        }
 
        if (flags & EXEC_OBJECT_NEEDS_FENCE)

Can you please try https://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=prescheduler
Comment 18 Matti Hämäläinen 2016-11-07 15:27:12 UTC
As mentioned on IRC, that kernel branch/patch fixes the corruption. Also, applying the same change to 4.7.10 fixes it as well.

However, when in fullscreen, and doing ping-pong style seeking, the "desktop frame" .. or whatever to call it, sometimes flickers into view momentarily. This does not occur with DRI3 (+TearFree). Maybe that is a "defect" of DRI2?

Otherwise I think this one is solved. :)
Comment 19 Matti Hämäläinen 2016-11-07 15:41:03 UTC
One more note: the occasional "desktop frame" visible thing only happens with DRI2+TearFree, with DRI2 + TearFree OFF it won't happen, nor with DRI3.
Comment 20 Chris Wilson 2016-11-07 15:48:21 UTC
Define flicker? Whole screen, or the fuzzy set of lines as before?
Comment 21 Matti Hämäläinen 2016-11-07 15:56:44 UTC
Parts of the screen are from the 'desktop' frame with MPV window visible etc, for duration of one frame or so.

It is not continuous flickering, but can be triggered with that ping-pong seeking, possibly combined with switching between fullscreen and windowed video few times, can take some effort tho.

Oh, and the parts that "show through" are those that should be black when 4:3 video is scaled to 16:9 fullscreen, e.g. "side borders".
Comment 22 Matti Hämäläinen 2016-11-07 15:58:53 UTC
Ehm, to be precise, indeed not fuzzy "corrupted" lines, but rectangular areas outside of the scaled video image.
Comment 23 Chris Wilson 2016-11-07 16:06:50 UTC
Sound idiosyncratic of how DRI2 operates in sending new buffers to the client unsynchonrized to the actual rendering stream of the client (i.e. it is possible for the buffer to be change in the middle of a draw cycle by sending an invalidate event after it has done the glClear causing it to refetch the backbuffer...). Basically (mesa) intel_prepare_render() should only update the buffers once per frame. Maybe something else but there is a race in DRI2 that could present with those symptoms.
Comment 24 yann 2016-11-07 16:54:10 UTC
Reference to Chris' patch: https://patchwork.freedesktop.org/series/14930/
Comment 25 Chris Wilson 2016-11-07 21:02:27 UTC
commit 7aa6ca61ee5546d74b76610894924cdb0d4a1af0
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Nov 7 16:52:04 2016 +0000

    drm/i915: Mark CPU cache as dirty when used for rendering
    
    On LLC, or even snooped, machines rendering via the GPU ends up in the CPU
    cache. This cacheline dirt also needs to be flushed to main memory when
    moving to an incoherent domain, such as the display's scanout engine.
    Mostly, this happens because either the object is marked as dirty from
    its first use or is avoided by setting the object into the display
    domain from the start.

I have an idea for the DRI2 patch, might be worth opening a fresh bug for that incomplete frame glitch. In fact, lets do that if you want to do some more testing...
Comment 26 yann 2016-11-08 08:10:00 UTC
Matti, can you re-test with that commit and confirm whether or not issue is now fixed on your side?
Comment 27 Matti Hämäläinen 2016-11-08 09:01:24 UTC
Yes, tested the v2 patch from patchwork against vanilla kernel 4.7.10 and confirmed it working.
Comment 28 yann 2016-11-08 09:03:27 UTC
(In reply to Matti Hämäläinen from comment #27)
> Yes, tested the v2 patch from patchwork against vanilla kernel 4.7.10 and
> confirmed it working.

Thanks Matti, then closing as fixed


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.