Bug 78992 - [regression] DRI2 races
Summary: [regression] DRI2 races
Status: RESOLVED WORKSFORME
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/intel (show other bugs)
Version: git
Hardware: x86-64 (AMD64) Linux (All)
: medium major
Assignee: Chris Wilson
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-05-20 22:29 UTC by Armin K
Modified: 2014-05-30 11:18 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
Current Xorg.0.log, with xf86-video-intel-2.99.911 (53.10 KB, text/plain)
2014-05-21 09:42 UTC, Armin K
no flags Details
KDE Desktop Effects "Advanced Tab" Screenshot (29.52 KB, image/png)
2014-05-21 09:43 UTC, Armin K
no flags Details
Screenshot of the corrupted area (122.17 KB, image/jpeg)
2014-05-21 09:54 UTC, Armin K
no flags Details
Screenshot of the "ghost" konsole (362.80 KB, image/jpeg)
2014-05-22 09:28 UTC, Armin K
no flags Details
Screenshot of the corrupted KDE "start menu" (374.92 KB, image/jpeg)
2014-05-22 09:30 UTC, Armin K
no flags Details
Another Screenshot of the corrupted KDE "start menu" (366.42 KB, image/jpeg)
2014-05-22 09:31 UTC, Armin K
no flags Details
Xorg log with the patch from Comment 23 (76.13 KB, application/x-xz)
2014-05-22 09:32 UTC, Armin K
no flags Details
Xorg log with the patch from Comment 23 and Comment 28 (75.62 KB, application/x-xz)
2014-05-22 11:26 UTC, Armin K
no flags Details

Description Armin K 2014-05-20 22:29:37 UTC
Latest git master of intel ddx causes corrupted rendering on my sandybridge laptop. I have bisected the driver and narrowed it down to the commit 7d516589ba9d0325e57e08d41becff64f81e. Unfortunately, no screenshot tool can capture anything useful since triggering a screenshot program makes it go away from other programs.
Comment 1 Chris Wilson 2014-05-21 05:50:57 UTC
Are you using a compositor? Which?
Comment 2 Chris Wilson 2014-05-21 08:02:42 UTC
Using multiple heads?
Comment 3 Chris Wilson 2014-05-21 08:28:27 UTC
commit 0b3ea29727075bf2b0d86bbeca7dd78201936913
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed May 21 08:26:21 2014 +0100

    sna/dri2: Tidy recording of the last swap sequence number and time

fixes a bug recording the wrong frame number on the wrong pipe (multihead), which could conceivably upset mesa.
Comment 4 Armin K 2014-05-21 09:29:10 UTC
(In reply to comment #1)
> Are you using a compositor? Which?

KWin 4.13.1/KDE SC 4.13.1
Comment 5 Armin K 2014-05-21 09:30:04 UTC
(In reply to comment #2)
> Using multiple heads?

I have 2 GPU's. Other is Radeon, with no output (muxless hybrid graphics). They are both registered in XServer.
Comment 6 Chris Wilson 2014-05-21 09:36:07 UTC
Using PRIME at all? Please attach your Xorg.0.log in any case. Which Desktop Effects do you have enabled (OpenGL 3.1, Native Render, fullscreen repaints, etc)?
Comment 7 Armin K 2014-05-21 09:42:39 UTC
Created attachment 99489 [details]
Current Xorg.0.log, with xf86-video-intel-2.99.911
Comment 8 Armin K 2014-05-21 09:43:24 UTC
Created attachment 99490 [details]
KDE Desktop Effects "Advanced Tab" Screenshot
Comment 9 Armin K 2014-05-21 09:44:23 UTC
(In reply to comment #6)
> Using PRIME at all? Please attach your Xorg.0.log in any case. Which Desktop
> Effects do you have enabled (OpenGL 3.1, Native Render, fullscreen repaints,
> etc)?

All desktop effects are enabled. I am using PRIME for VDPAU in apps that don't support VAAPI. See screenshot for KDE Desktop Effects options.
Comment 10 Chris Wilson 2014-05-21 09:52:19 UTC
I have my SNB laptop set up similarly (except for the dGPU, so I hope that is irrelevant - it should be). Does the corruption occur reliably with any particular sequence of operations?
Comment 11 Armin K 2014-05-21 09:54:46 UTC
Created attachment 99493 [details]
Screenshot of the corrupted area

I have managed to capture this with my mobile phone. The area between two tabs represents nothing currently, but there's a button. On the other hand, when I click the last button, nothing happens since it isn't actually there, but in the empty area between two tabs - not synced correctly. Note that latest git master with commit mentioned earlier in this bug do not fix the problem.
Comment 12 Armin K 2014-05-21 09:57:41 UTC
(In reply to comment #10)
> I have my SNB laptop set up similarly (except for the dGPU, so I hope that
> is irrelevant - it should be). Does the corruption occur reliably with any
> particular sequence of operations?

Scrolling corruptions, firefox issues, video frames not syncing properly (they are in full screen though), display artifacts sometimes when moving windows, etc.
Comment 13 Chris Wilson 2014-05-21 10:06:36 UTC
Which version of mesa?
Comment 14 Armin K 2014-05-21 10:07:59 UTC
Mesa 10.3.0 (git master), latest commit is gf7bf37c
Comment 15 Chris Wilson 2014-05-21 10:11:14 UTC
Well, that's one thing. I'm using a f20 with mesa-10.1.3 atm.
Comment 16 Chris Wilson 2014-05-21 10:15:39 UTC
The "use Xorg triple buffering" patch should be easily disabled by

diff --git a/src/sna/sna_dri2.c b/src/sna/sna_dri2.c
index 2fe151b..cd1ee7b 100644
--- a/src/sna/sna_dri2.c
+++ b/src/sna/sna_dri2.c
@@ -55,7 +55,7 @@ USE OR OTHER DEALINGS IN THE SOFTWARE.
 #error DRI2 version supported by the Xserver is too old
 #endif
 
-#if DRI2INFOREC_VERSION < 6
+#if DRI2INFOREC_VERSION < 6 || 1
 #define XORG_CAN_TRIPLE_BUFFER 0
 #else
 #define XORG_CAN_TRIPLE_BUFFER 1


which would be a useful confirmation.
Comment 17 Chris Wilson 2014-05-21 11:04:49 UTC
commit bf4475d29de589a0d0e71dfb895d19a635f8cc3e
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed May 21 11:31:29 2014 +0100

    sna/dri2: Tidy conditional use of XORG_CAN_TRIPLE_BUFFER

should not have had any functional impact, but please check. ;)
Comment 18 Armin K 2014-05-21 17:26:17 UTC
(In reply to comment #16)
> The "use Xorg triple buffering" patch should be easily disabled by
> 
> diff --git a/src/sna/sna_dri2.c b/src/sna/sna_dri2.c
> index 2fe151b..cd1ee7b 100644
> --- a/src/sna/sna_dri2.c
> +++ b/src/sna/sna_dri2.c
> @@ -55,7 +55,7 @@ USE OR OTHER DEALINGS IN THE SOFTWARE.
>  #error DRI2 version supported by the Xserver is too old
>  #endif
>  
> -#if DRI2INFOREC_VERSION < 6
> +#if DRI2INFOREC_VERSION < 6 || 1
>  #define XORG_CAN_TRIPLE_BUFFER 0
>  #else
>  #define XORG_CAN_TRIPLE_BUFFER 1
> 
> 
> which would be a useful confirmation.

Applying the patch makes the error go away.
Comment 19 Armin K 2014-05-21 17:26:42 UTC
(In reply to comment #17)
> commit bf4475d29de589a0d0e71dfb895d19a635f8cc3e
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Wed May 21 11:31:29 2014 +0100
> 
>     sna/dri2: Tidy conditional use of XORG_CAN_TRIPLE_BUFFER
> 
> should not have had any functional impact, but please check. ;)

Still the same. I even upgraded Mesa to latest git master.
Comment 20 Chris Wilson 2014-05-21 20:09:29 UTC
Next step:

commit 66e14c96d0ddede90e36084a80a97c3f16c2b386
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed May 21 18:38:36 2014 +0100

    sna/dri2: Limit pending swaps to 1 when queueing a future blit
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 21 Chris Wilson 2014-05-22 07:42:45 UTC
Another day, another bunch of DRI2 review:

commit 80dfbaa3c39e1c56aaa9f8951e3bbcc30a9b748d
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu May 22 00:24:07 2014 +0100

    sna/dri2: Move fixed array allocations to per-crtc
Comment 22 Armin K 2014-05-22 08:57:11 UTC
Regression still present in git master. Tested few minutes ago.
Comment 23 Chris Wilson 2014-05-22 09:05:53 UTC
Can you apply

diff --git a/src/sna/sna_dri2.c b/src/sna/sna_dri2.c
index ed2e561..52f283d 100644
--- a/src/sna/sna_dri2.c
+++ b/src/sna/sna_dri2.c
@@ -51,6 +51,9 @@ USE OR OTHER DEALINGS IN THE SOFTWARE.
 #include <compositeext.h>
 #endif
 
+#undef DBG
+#define DBG(x) ErrorF x
+
 #if DRI2INFOREC_VERSION < 2
 #error DRI2 version supported by the Xserver is too old
 #endif

reproduce the bug and attach the log?

Yes, you will need

commit 0f0cd87c66a4e373ab9b26626514d43b5f4978e5
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu May 22 10:04:09 2014 +0100

    sna/dri2: DBG fixes

;-)
Comment 24 Armin K 2014-05-22 09:28:48 UTC
Created attachment 99573 [details]
Screenshot of the "ghost" konsole

This happened right after I closed konsole program. Program itself was closed but the "ghost" image remained there until I moved the mouse.
Comment 25 Armin K 2014-05-22 09:30:12 UTC
Created attachment 99574 [details]
Screenshot of the corrupted KDE "start menu"

As you can see, half of the menu is transparent while some parts I moved mouse over are as they should be.
Comment 26 Armin K 2014-05-22 09:31:21 UTC
Created attachment 99575 [details]
Another Screenshot of the corrupted KDE "start menu"

Now with more icons available as the mouse was moved over them
Comment 27 Armin K 2014-05-22 09:32:36 UTC
Created attachment 99576 [details]
Xorg log with the patch from Comment 23

This is the log of the session, same one that I managed to take 3 screenshots.
Comment 28 Chris Wilson 2014-05-22 11:18:10 UTC
Maybe?

diff --git a/src/sna/sna_dri2.c b/src/sna/sna_dri2.c
index ed2e561..e9a89ab 100644
--- a/src/sna/sna_dri2.c
+++ b/src/sna/sna_dri2.c
@@ -2018,6 +2018,7 @@ sna_dri2_schedule_flip(ClientPtr client, DrawablePtr draw, xf86CrtcPtr crtc,
                                type = FLIP_THROTTLE;
                                info->mode = -type;
                                current_msc++;
+                               swap_limit(draw, 1);
                                goto out;
                        }
                }
Comment 29 Armin K 2014-05-22 11:26:26 UTC
Created attachment 99577 [details]
Xorg log with the patch from Comment 23 and Comment 28

I'm afraid it doesn't change anything.
Comment 30 Chris Wilson 2014-05-22 12:17:10 UTC
Time to start narrowing down the affected paths:

diff --git a/src/sna/sna_dri2.c b/src/sna/sna_dri2.c
index 85492b5..6cbda1d 100644
--- a/src/sna/sna_dri2.c
+++ b/src/sna/sna_dri2.c
@@ -2007,7 +2007,7 @@ sna_dri2_schedule_flip(ClientPtr client, DrawablePtr draw, xf86CrtcPtr crtc,
                                info->back = back;
                                sna_dri2_reference_buffer(back);
                        }
-                       if (current_msc >= *target_msc) {
+                       if (current_msc >= *target_msc && 0) {
                                DBG(("%s: executing xchg of pending flip\n",
                                     __FUNCTION__));
Comment 31 Armin K 2014-05-22 13:09:27 UTC
Issue is still there I'm afraid.
Comment 32 Chris Wilson 2014-05-22 13:13:45 UTC
Hmm. How about?

diff --git a/src/sna/sna_dri2.c b/src/sna/sna_dri2.c
index 85492b5..597bc1d 100644
--- a/src/sna/sna_dri2.c
+++ b/src/sna/sna_dri2.c
@@ -1999,7 +1999,7 @@ sna_dri2_schedule_flip(ClientPtr client, DrawablePtr draw, xf86CrtcPtr crtc,
                DBG(("%s: performing immediate swap on pipe %d, pending? %d, mode: %d\n",
                     __FUNCTION__, pipe, info != NULL, info ? info->mode : 0));
 
-               if (info && info->draw == draw) {
+               if (info && info->draw == draw && 0) {
                        assert(info->type != FLIP);
                        assert(info->front == front);
Comment 33 Armin K 2014-05-22 14:56:21 UTC
Still the same.
Comment 34 Chris Wilson 2014-05-22 19:24:31 UTC
Back to the hunt:

diff --git a/src/sna/sna_dri2.c b/src/sna/sna_dri2.c
index 85492b5..4c481ab 100644
--- a/src/sna/sna_dri2.c
+++ b/src/sna/sna_dri2.c
@@ -1992,7 +1992,7 @@ sna_dri2_schedule_flip(ClientPtr client, DrawablePtr draw, xf86CrtcPtr crtc,
        union drm_wait_vblank vbl;
        CARD64 current_msc;
 
-       if (immediate_swap(sna, *target_msc, divisor, crtc, &current_msc)) {
+       if (immediate_swap(sna, *target_msc, divisor, crtc, &current_msc) && 0) {
                int type;
 
                info = sna->dri2.flip_pending;
Comment 35 Armin K 2014-05-22 19:38:25 UTC
Seems to work fine.
Comment 36 Armin K 2014-05-26 13:31:13 UTC
Issue still present in latest git master though.
Comment 37 Chris Wilson 2014-05-26 13:34:21 UTC
I notice that mesa has a couple of similar-ish regression bugs from 10.2, any chance you can sanity check with older mesa?

I've been without internet for a few days, but I do have some more tweaks under way...
Comment 38 Armin K 2014-05-26 16:11:01 UTC
Okay, some progress has been made.

I have tested Mesa-10.0.5, Mesa-10.1.4 and Mesa-10.3.0-devel (git-1472584), all 3 have same problem with xf86-video-intel git master.

The culprit here seems to be kwin itself. When I disable kwin effects (alt+shift+f12), all issues go away.
Comment 39 Armin K 2014-05-26 16:19:19 UTC
Also, I have disabled discrete gpu by using radeon.disable=1 on kernel cmd line and that did not fix the problem which confirms that prime setup isn't something that causes problems here.

Also, forgot to note:

$ kwin --version
Qt: 4.8.6
KDE Development Platform: 4.13.1
KWin: 4.11.9
Comment 40 Chris Wilson 2014-05-26 19:49:15 UTC
For fun:

commit 2a0dc7ed4961d5bda08bb583a7a6a2cc633b27e4
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon May 26 08:23:45 2014 +0100

    sna/dri2: Move scanout processing from frame event to global
    
    The scanout is a global property, track it as so. The primary benefit to
    this is it strengthens our assertions that we never hand out an active
    scanout for use as a back buffer.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

commit 05f149285b429c7b3f25ac8049477230aaef512c
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sun May 25 07:14:53 2014 +0100

    sna/dri2: Defer reallocation of backbuffer until request
    
    If we defer the reallocation of the backbuffer until the client requests
    the set of current buffers (with GetBuffers), then we can often avoid
    allocating the fresh backbuffer as the flip often retires before the
    client is ready.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>


Can I ask if you can also compile with --enable-debug (if you haven't already)?
Comment 41 Armin K 2014-05-27 08:59:42 UTC
Problem is still here. I did however compile with --enable-debug. What am I exactly looking for? There's nothing new in Xorg.0.log (I didn't build with patch from Comment 23 this time).
Comment 42 Chris Wilson 2014-05-27 09:27:19 UTC
There will be an additional line when SNA starts confirming that asserts are enabled. Otherwise there will be no change except for when an assert is hit and X dies.
Comment 43 Chris Wilson 2014-05-28 10:26:14 UTC
I still haven't hit this myself yet. With the GetBuffers() assert that the back buffer we hand back is not on the scanout, I can be reasonably certain that the error is not that.

Two things can I ask you to double check:

1. That 

diff --git a/src/sna/sna_dri2.c b/src/sna/sna_dri2.c
index ac73680..daf1d08 100644
--- a/src/sna/sna_dri2.c
+++ b/src/sna/sna_dri2.c
@@ -211,7 +211,7 @@ sna_dri2_window_get_chain(WindowPtr win)
 #error DRI2 version supported by the Xserver is too old
 #endif
 
-#if DRI2INFOREC_VERSION < 6
+#if DRI2INFOREC_VERSION < 6 || 1
 #define XORG_CAN_TRIPLE_BUFFER 0
 #define swap_limit(d, l) false
 #else

still restores the old behaviour, and

2. whether

diff --git a/src/sna/sna_dri2.c b/src/sna/sna_dri2.c
index 78c4069..ac73680 100644
--- a/src/sna/sna_dri2.c
+++ b/src/sna/sna_dri2.c
@@ -898,6 +898,8 @@ __sna_dri2_copy_region(struct sna *sna, DrawablePtr draw, RegionPtr region,
                if (priv)
                        src_bo = priv->gpu_bo;
        }
+       if (src_bo != src_priv->bo)
+               ErrorF("applying stale buffer fixup to Window source!\n");
 
        dst_bo = dst_priv->bo;
        if (dst->attachment == DRI2BufferFrontLeft) {
@@ -916,6 +918,9 @@ __sna_dri2_copy_region(struct sna *sna, DrawablePtr draw, RegionPtr region,
        } else
                sync = false;
 
+       if (dst_bo != dst_priv->bo)
+               ErrorF("applying stale buffer fixup to Window destination!\n");
+
        if (!wedged(sna)) {
                xf86CrtcPtr crtc;
 

issues any warnings when you see the errors?
Comment 44 Armin K 2014-05-28 11:00:05 UTC
I haven't yet hit any assert. as Xserver hasn't crashed.

I did however suspect that gcc-4.9 might be causing this as it does cause weird behaviour for some other packages. I tried building with clang-3.4.1 but that caused intel_drv.so to segfault, preventing xserver to start - but that's worth another bug report.

1. from last comment:

The patch indeed fixes the behaviour so I can once again confirm that the original information is correct and that triple buffering patch causes this.

I'll try 2. soon and report back.
Comment 45 Armin K 2014-05-28 11:07:32 UTC
Patch from Comment 43, 2. didn't add anything new into Xorg.0.log but I didn't apply the debug patch from Comment 23. Should I apply that one too?
Comment 46 Chris Wilson 2014-05-28 11:10:18 UTC
The pair of ErrorFs in comment 43 will work by themselves.
Comment 47 Armin K 2014-05-28 11:20:02 UTC
Nothing happened with that patch. Since I'm unable to build with clang, I'll try to build the driver with gcc-4.8.3 and report back to see if it's gcc-4.9.0 that's causing the issue.
Comment 48 Armin K 2014-05-28 11:48:10 UTC
Building with gcc-4.8.3 didn't fix the problem, still the same. That irons out gcc-4.9.0.
Comment 49 Armin K 2014-05-28 12:16:33 UTC
Good news. I just upgraded to kernel 3.15.0-rc7 and problem went away. It seems that something is wrong in kernel 3.14.4 which I was using.
Comment 50 Armin K 2014-05-29 12:23:39 UTC
No issues in last 24 hours with 3.15.0-rc7 kernel.
Comment 51 Chris Wilson 2014-05-30 11:18:51 UTC
I don't have an explanation for how this broke in your 3.14 kernel and then was fixed in 3.15, but I am happy that it was...

If you do get really, really bored, you can try bisecting and see if we can come up with an explanation. But for the time being, let's move on.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.