Bug 70764

Summary:

[snb] TearFree now triggers stuck semaphores (due to Eliminate the synchronous wait from inside TearFree)

Product:

xorg

Reporter:

FBrown <francisbrwn9>

Component:

Driver/intel

Assignee:

Chris Wilson <chris>

Status:

RESOLVED FIXED

QA Contact:

Intel GFX Bugs mailing list <intel-gfx-bugs>

Severity:

normal

Priority:

low

CC:

inbox, samuel

Version:

git

Hardware:

x86-64 (AMD64)

OS:

Linux (All)

Whiteboard:

i915 platform:

i915 features:

Attachments:

Description	Flags
crash log on drm-intel-nightly	none

Description FBrown 2013-10-22 14:38:45 UTC

Bisected to latest commit: d788b69fdb5ea73b1d283a89e53b2b19eaa90e6e

sna: Enable TearFree rendering for transformed scanouts

On a:
Ubuntu linux saucy
Kernel 3.11.0-12-generic (3.11.3?)
xorg 1.14.3 
SNB: HD3000 mobile

with tearfree enabled by xorg.conf

have X freezing for several seconds at a time and a corresponding error in kernel log of

[drm:ring_stuck] *ERROR* Kicking stuck semaphore on render ring

occasionally appended by 

[drm] capturing error event; look for more information in /sys/kernel/debug/dri/0/i915_error_state
[drm:i915_set_reset_status] *ERROR* render ring hung inside bo (0x575a000 ctx 0) at 0x575aa64

Comment 1 FBrown 2013-10-22 15:06:37 UTC

My error. Problematic commit actually seems to be the previous one.

sna: Eliminate the synchronous wait from inside TearFree

Comment 2 Chris Wilson 2013-10-22 15:13:16 UTC

The stuck semaphore is an old, old hw issue. Are you using TearFree?

Comment 3 FBrown 2013-10-22 15:15:54 UTC

(In reply to comment #2)
> The stuck semaphore is an old, old hw issue. Are you using TearFree?

Yes, as get tearing on video without it.

Comment 4 Chris Wilson 2013-10-22 15:21:59 UTC

See #54226 for the stuck semaphore issue. I will have a think about why the update made it easier to hit.

Comment 5 FBrown 2013-10-22 15:31:11 UTC

Thanks. To the best of my knowledge I've not "hit" that error on this hardware before through multiple distribution and major kernel version upgrades. May be an old problem per se, but new with that commit in this case.

Comment 6 Chris Wilson 2013-10-22 15:36:14 UTC

In recent kernels (by which I mean drm-intel-nightly), hangcheck will capture the error upon the first struck semaphore. That /sys/class/drm/card0/error would be useful to see when the error occurs. (I'm wondering if this something funky with the pageflip for instance.)

Comment 7 FBrown 2013-10-22 15:42:19 UTC

(In reply to comment #6)
> In recent kernels (by which I mean drm-intel-nightly), hangcheck will
> capture the error upon the first struck semaphore. That
> /sys/class/drm/card0/error would be useful to see when the error occurs.
> (I'm wondering if this something funky with the pageflip for instance.)

I'll grab a build from here 

http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-nightly/current/ 

and try that at some point later today then.

Comment 8 FBrown 2013-10-22 16:04:52 UTC

Created attachment 88007 [details]
crash log on drm-intel-nightly

OK. Following.

[drm] stuck on render ring
[  610.496014] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[  610.496015] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[  610.496015] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[  610.496016] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  610.496016] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.

crash dump is attached

Comment 9 Chris Wilson 2013-10-22 16:22:56 UTC

Doesn't seem to be an extraordinary hang, so it would seem to be something peculiar to the timing of the new rendering.

Comment 10 Chris Wilson 2013-10-22 20:05:57 UTC

Not intentionally targetting this bug, but since this should affect timing subtly might as well see if it has any impact here:

commit 247ebf86a32b518536a11c2f866b2c885bd55d38
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Oct 22 07:58:03 2013 +0100

    sna: Only force the TearFree exchange before a write
    
    We can read from the current scanout buffer without incurring a visual
    tear, so delay the GPU bo exchange until we see a write to the object.

Comment 11 FBrown 2013-10-22 21:08:33 UTC

Recompiled up to latest git and seems to significantly better, although not perfect.

Whereas before it would hang roughly once every 3-4 mins, now only had a couple in 30-40 mins.

Comment 12 FBrown 2013-10-23 10:49:14 UTC

Is there any more info I can give to help with this?

If not, then for now I would prefer to revert back to the last non buggy commit or stable snapshot.

Comment 13 Chris Wilson 2013-10-23 11:52:55 UTC

One last thing you can test is:

diff --git a/src/sna/sna_display.c b/src/sna/sna_display.c
index 7fcade6..1ae9ec0 100644
--- a/src/sna/sna_display.c
+++ b/src/sna/sna_display.c
@@ -4303,12 +4303,14 @@ static bool wait_for_shadow(struct sna *sna, struct sna_pixmap *priv, unsigned f
                sna_mode_wakeup(sna);
 
        if (sna->mode.shadow_flip) {
-               bo = kgem_create_2d(&sna->kgem,
-                                   pixmap->drawable.width,
-                                   pixmap->drawable.height,
-                                   pixmap->drawable.bitsPerPixel,
-                                   priv->gpu_bo->tiling,
-                                   CREATE_EXACT | CREATE_SCANOUT);
+               bo = NULL;
+               if (0)
+                       bo = kgem_create_2d(&sna->kgem,
+                                           pixmap->drawable.width,
+                                           pixmap->drawable.height,
+                                           pixmap->drawable.bitsPerPixel,
+                                           priv->gpu_bo->tiling,
+                                           CREATE_EXACT | CREATE_SCANOUT);
                if (bo != NULL) {
                        DBG(("%s: replacing still-attached GPU bo\n",
                             __FUNCTION__));

applied to 

commit b70390b4f824afeed8f1b5a9baf79e33377405cd
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Oct 23 11:15:28 2013 +0100

    sna: Reset bo after allocation failure during wait-for-shadow

Comment 14 FBrown 2013-10-23 12:10:33 UTC

OK. When I see that commit in the repo I'll give the patch at try.

Comment 15 Chris Wilson 2013-10-23 12:12:31 UTC

Oops, should be there now.

Comment 16 FBrown 2013-10-23 12:14:16 UTC

It is. I'll pull the changes in, apply that patch and test. Thanks.

Comment 17 FBrown 2013-10-23 12:30:33 UTC

Patch does not seem to apply on top of that commit? Could just be me, as I don't do these things manually very often, but can you re-check?

Comment 18 Chris Wilson 2013-10-23 12:42:33 UTC

Ok, try this simpler one instead:

diff --git a/src/sna/sna_display.c b/src/sna/sna_display.c
index 7fcade6..5c04d78 100644
--- a/src/sna/sna_display.c
+++ b/src/sna/sna_display.c
@@ -4299,7 +4299,7 @@ static bool wait_for_shadow(struct sna *sna, struct sna_pixmap *priv, unsigned f
        damage = sna->mode.shadow_damage;
        sna->mode.shadow_damage = NULL;
 
-       while (sna->mode.shadow_flip && sna_mode_has_pending_events(sna))
+       while (sna->mode.shadow_flip && sna_mode_wait_for_event(sna))
                sna_mode_wakeup(sna);
 
        if (sna->mode.shadow_flip) {

Comment 19 FBrown 2013-10-23 13:30:33 UTC

Have both patchs applying now. Can I just check whether you want both applied to test, or just one of the 2?

Comment 20 Chris Wilson 2013-10-23 13:33:37 UTC

They both have the same effect, and can be used combined or individually.

Comment 21 FBrown 2013-10-24 12:14:55 UTC

Definitely a marked improvement with those patches. Not perfect, as had one hang on DE login and a couple of others over the last day, but still a very significant reduction making. So not fixed, but you would have to say the desktop is now usable if you accept the very rare hang.

As a note, now reverted and turned off tearfree via xorg.conf, as I noticed KWIN has some newer anti-tearing options that I had not tried before. Tearing prevention (vsync) set at "full screen repaints" in KWIN seems to achieve much the same result as tearfree via xorg/driver option, which seems to make sense to me.

Comment 22 Chris Wilson 2013-10-24 12:35:20 UTC

Yes, using the tear-free mode of the compositor (if you are using a compositor) is the preferrable solution. The driver option is really for only those people who insist on not using a compositor, but still demand tearfree, or if they require rotated outputs (which cannot yet be done with xrandr & compositing managers).

Thanks for the feedback, I shall try and work out why that particular copy is more prone to triggering the bad semaphore update.

Comment 23 FBrown 2013-10-24 18:59:18 UTC

Yes, but unfortunately prior to the latest KWIN updates I had tearing under KWIN no matter what options were selected. Hence needing it enabled by the driver, which granted was not ideal. KWIN in KDE 4.11 introduced more options for tearing prevention, rather than a just a toggle on/off for vsync, and one of those options seems to work nicely with this hardware allowing me to dispense with the driver tearfree option (I hope).

Thanks for the assistance, and hopefully has been useful regardless.

Comment 24 Chris Wilson 2014-04-30 11:04:32 UTC

I've been tweaking TearFree (in xf86-video-intel.git) to avoid more copies, any chance you can see if that helps?

Comment 25 FBrown 2014-04-30 13:03:21 UTC

Have been using KDE's own tear free compositing options since this bug, but if I can set things back to how they originally were using the xorg.conf tear free instead, then I can have a test sometime in the next week. With the caveat of course that my system, kernel, org, etc etc have gone through at least one version upgrade since I originally reported.

Comment 26 Chris Wilson 2014-10-07 19:36:25 UTC

*** Bug 84755 has been marked as a duplicate of this bug. ***

Comment 27 Chris Wilson 2015-05-23 22:16:38 UTC

I believe with

commit b47161858ba13c9c7e03333132230d66e008dd55
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Apr 27 13:41:17 2015 +0100

    drm/i915: Implement inter-engine read-read optimisations

the problem is reduce to the normal semaphores bug 54226 (i.e. frequency should be reduced). So closing this instance.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.