Bug 70905

Summary: display turns off with intel(0): sna_mode_redisplay: page flipping failed, disabling CRTC:3 (pipe=0)
Product: xorg Reporter: Michael Stapelberg <michael+freedesktop>
Component: Driver/intelAssignee: Chris Wilson <chris>
Status: RESOLVED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: hakakaha, joe, martin.monperrus
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Bug Depends on:    
Bug Blocks: 73856    
Attachments:
Description Flags
entire Xorg.0.log
none
(new) entire xorg.log
none
Xorg.0.log
none
dmesg with drm.debug=6
none
dmesg with drm.debug=6 from current drivers
none
stdout/stderr from startx during issue
none
Xorg.0.log --enable-debug-full damage box is beyond the pixmap
none
damage box is beyond the drawable 2.99.907-38-g4c7b183 none

Description Michael Stapelberg 2013-10-27 08:34:12 UTC
Created attachment 88168 [details]
entire Xorg.0.log

I just upgraded to the intel driver in version 2.99.905 in order to try the improved TearFree option. It seems to work fine (yay, thanks!), but a couple minutes later my display just turned off. Switching to vt1 and back to vt7 was enough to get it back on, but I suspect this will happen more often, so I figured I’d report it.

The last few lines in /var/log/Xorg.0.log are:

[2450513.409] (II) intel(0): switch to mode 2560x1440@60.0 on pipe 0 using DP2, position (0, 0), rotation normal
[2451327.200] (EE) intel(0): sna_mode_redisplay: page flipping failed, disabling CRTC:3 (pipe=0)
[2451340.647] (II) AIGLX: Suspending AIGLX clients for VT switch
[2451346.532] (II) Open ACPI successful (/var/run/acpid.socket)
[2451346.532] (II) AIGLX: Resuming AIGLX clients after VT switch
[2451346.532] (II) intel(0): switch to mode 2560x1440@60.0 on pipe 0 using DP2, position (0, 0), rotation normal

Full Xorg.0.log attached. Please let me know if you need any other information. I am using the integrated GPU of the following processor:

cpu family	: 6
model		: 42
model name	: Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz
stepping	: 7
Comment 1 Chris Wilson 2013-10-27 15:35:03 UTC
This should prevent the display turning off after the odd failure:

commit 4e0a01a7a3cf0f473c7ffae9129069086bf2fbe2
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sun Oct 27 15:33:13 2013 +0000

    sna: Handle transient TearFree flip failures
    
    If we get a pageflip fail when trying to do a TearFree update, just
    fallback to a copy (before turning off the display for complete
    failure). The rare tearing copy should mar the user experience far less.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 2 Chris Wilson 2013-11-01 11:56:07 UTC
I believe the workaround should be effective and hide the issue.
Comment 3 Michael Stapelberg 2013-11-09 12:23:33 UTC
(In reply to comment #2)
> I believe the workaround should be effective and hide the issue.
Thanks for the quick fix! I only now had a chance to upgrade my version of the driver, and so far I haven’t had the issue. Will report back in the unlikely event that it still happens.
Comment 4 Michael Stapelberg 2013-11-09 14:11:09 UTC
Created attachment 88930 [details]
(new) entire xorg.log
Comment 5 Michael Stapelberg 2013-11-09 14:11:27 UTC
(In reply to comment #3)
> (In reply to comment #2)
> > I believe the workaround should be effective and hide the issue.
> Thanks for the quick fix! I only now had a chance to upgrade my version of
> the driver, and so far I haven’t had the issue. Will report back in the
> unlikely event that it still happens.
It just happened again:

[3146197.461] (EE) intel(0): sna_mode_redisplay: page flipping failed, disabling CRTC:3 (pipe=0)
[3146199.888] (II) AIGLX: Suspending AIGLX clients for VT switch
[3146200.700] (II) Open ACPI successful (/var/run/acpid.socket)
[3146200.700] (II) AIGLX: Resuming AIGLX clients after VT switch
[3146200.700] (II) intel(0): switch to mode 2560x1440@60.0 on pipe 0 using DP2, position (0, 0), rotation normal
[3146201.213] setversion 1.4 failed

This is with commit abf1a16914d993cc150005879375d4bb17fdccf3.

New full Xorg.0.log attached.
Comment 6 Chris Wilson 2013-11-09 17:04:33 UTC
That's quite an old version of xf86-video-intel: 2.21.15. Certainly doesn't contain the recent fix.
Comment 7 Michael Stapelberg 2013-11-09 20:33:42 UTC
(In reply to comment #6)
> That's quite an old version of xf86-video-intel: 2.21.15. Certainly doesn't
> contain the recent fix.
I built it from git. The commit ID I gave you was accurate. Ignore the version it reports, I build it as a Debian package without updating the version number.
Comment 8 Chris Wilson 2013-11-10 10:51:02 UTC
Ah, this looks to be a different class of fail. Do you have the dmesg for that run? Could you reproduce this with drm.debug=6 (or echo 6 > /sys/modules/drm/parameters/debug) and grab the dmesg?
Comment 9 Michael Stapelberg 2013-11-11 07:20:53 UTC
(In reply to comment #8)
> Ah, this looks to be a different class of fail. Do you have the dmesg for
> that run? Could you reproduce this with drm.debug=6 (or echo 6 >
> /sys/modules/drm/parameters/debug) and grab the dmesg?
I looked into /var/log/syslog, but don’t see any dmsg output for that time period.

I can try to reproduce it, but not before the weekend.
Comment 10 Janusz 2013-11-15 23:48:48 UTC
Created attachment 89295 [details]
Xorg.0.log
Comment 11 Janusz 2013-11-15 23:52:50 UTC
Created attachment 89297 [details]
dmesg with drm.debug=6
Comment 12 Janusz 2013-11-15 23:53:45 UTC
Working on xf86-video-intel-2.21.15 here, also experiencing screen blackouts. I've attached my Xorg.log and dmesg after switching drm.debug=6.
I will try to build newer drivers and test them tomorrow. Triggering the bug is easy for me, using gMaps in a fullscreen browser window.
Comment 13 Janusz 2013-11-16 11:10:42 UTC
Created attachment 89313 [details]
dmesg with drm.debug=6 from current drivers

Updated to current from git. No error shows up in Xorg.log now, the blackout sill occurs.
During boot up, there is 
[drm:intel_pipe_config_compare] *ERROR* mismatch in gmch_pfit.lvds_border_bits (expected 32768, found 0)
I'm not sure now if it's the same issue.
Comment 14 Janusz 2013-11-21 12:59:03 UTC
I'm now working on last tagged version from git - 2.99.906. Seems a little more stable than .905, at least on desktop G45. On mobile GM45 I have some strange flickering with old frames showing. On both machines still some occasional xorg crushes.
So no TearFree for me. I was only testing it because of tearing while decoding with vaapi/intel-driver g45-h264 branch so I'm still impressed.
Comment 15 Chris Wilson 2014-01-09 13:35:35 UTC
A couple more bugs with TearFree were fixed, but I'm interested in getting a --enable-debug=full Xorg.0.log that ends with the page flip fail.
Comment 16 Chris Wilson 2014-01-17 18:40:40 UTC
*** Bug 71585 has been marked as a duplicate of this bug. ***
Comment 17 Joe Peterson 2014-01-18 18:12:04 UTC
Chris, I am able to reproduce this faithfully by enabling TearFree, launching the Spotify application, and searching for something in Spotify.  Almost immediately (after I see some changes in Spotify's window), the screen blanks, and I then see text from the console run of startx.  The console is hung at that point.

I pulled the latest git and compiled with --enable-debug=full.  The Xorg.0.log is too large for bugzilla, so here's the link:

http://data.boulder.swri.edu/~joe/Xorg.0.log

Let me know after you have grabbed this so I can remove it - hope it's helpful.
Comment 18 Chris Wilson 2014-01-18 18:22:44 UTC
Hmm, grabbed your Xorg.0.log, but I don't see the failure message and subsequent disabling of the output.

Just to be sure, these are the last lines I have:

[   612.221] sna_pixmap_move_to_gpu(pixmap=4, usage=16), flags=a
[   612.221] wait_for_shadow: flags=a, shadow_flip=0, handle=31, wait=58, old=58
[   612.221] sna_pixmap_move_to_gpu: already all-damaged
[   612.221] sna_pixmap_mark_active: pixmap=4, handle=31
Comment 19 Joe Peterson 2014-01-18 18:29:12 UTC
(In reply to comment #18)

Chris, true, it ends that way - not sure why, as it does blank/crash (maybe it did not have a chance to fully flush?).  I will now build 2.99.907 with full debug and see if I can get it to produce a log file with the message - maybe that will help...  stand by.
Comment 20 Joe Peterson 2014-01-18 19:22:59 UTC
(In reply to comment #19)

OK, Chris, I've done 4 more tests, using 2.99.907 and the latest git.  All end up crashing, but I noticed that, like before, the full debug logs seem not to have the page flipping fail msg, whereas the no debug logs do.  Also, when I make it crash with no debug, I get to a blank screen, and then I can go to another console and log back in.  Whereas if full debug is enabled, I see blank, then the last screen from the console in which I ran startx, and I cannot go to another console (need to shut down and restart).  Here are the files; let me know when you get them:

http://data.boulder.swri.edu/~joe/Xorg.0.log_2.99.907_full_debug
http://data.boulder.swri.edu/~joe/Xorg.0.log_2.99.907_no_debug
http://data.boulder.swri.edu/~joe/Xorg.0.log_git_full_debug
http://data.boulder.swri.edu/~joe/Xorg.1.log_git_no_debug

Any clue as to why I am seeing this difference with full debug on?  Let me know if I can try anything else.
Comment 21 Chris Wilson 2014-01-18 21:28:00 UTC
(In reply to comment #20)
> Any clue as to why I am seeing this difference with full debug on?  Let me
> know if I can try anything else.

Full-debug will also turn on a number of assertions which it may then be tripping over. Unfortunately these are only reported on stderr. Do you have that captured to any logfile (e.g. xdm.log or gdm.log)?
Comment 22 Joe Peterson 2014-01-19 00:49:35 UTC
Created attachment 92365 [details]
stdout/stderr from startx during issue

Chris, attached is the output from startx.  Hope this is what you were referring to.  It does show an assertion.

I uploaded the associated/new Xorg.0.log to the same location as before:

http://data.boulder.swri.edu/~joe/Xorg.0.log_git_full_debug
Comment 23 Chris Wilson 2014-01-19 12:52:16 UTC
Ok, but I don't think that assert explains the pageflip failure though. :(

This should fix the assert:

diff --git a/src/sna/sna_display.c b/src/sna/sna_display.c
index 10ee9e4..000d9ab 100644
--- a/src/sna/sna_display.c
+++ b/src/sna/sna_display.c
@@ -4530,7 +4530,17 @@ void sna_mode_redisplay(struct sna *sna)
                return;
        }
 
-       assert(sna_pixmap(sna->front)->move_to_gpu == NULL);
+       {
+               struct sna_pixmap *priv;
+
+               priv = sna_pixmap(sna->front);
+               assert(priv != NULL);
+
+               if (priv->move_to_gpu)
+                       (void)priv->move_to_gpu(sna, priv, 0);
+
+               assert(priv->move_to_gpu == NULL);
+       }
 
        for (i = 0; i < config->num_crtc; i++) {
                xf86CrtcPtr crtc = config->crtc[i];
Comment 24 Joe Peterson 2014-01-19 15:17:17 UTC
(In reply to comment #23)

Is there a debug level or any other debug tooling I can use that will not trigger the asserts, but will give you more info on the failure?  I can reproduce it instantly every time with my setup, so this is a good opportunity to find the problem if you can think of way I can get you more info.
Comment 25 Janusz 2014-01-20 12:52:51 UTC
I was trying to trigger page flip error with 2.99.907-31-gde0797e --enable-debug=full build with no effect. Tried Spotify client as Joe suggested (running in wine, right?), got X Server termination with:

(EE) sna_copy_boxes: damage box is beyond the pixmap: box=(1, 1), (32, 33), pixmap=(32, 32)

I thought it was different issue so just to be sure tried with TearFree disabled - gives me the same error. So is Xorg.log with current config and some lines before crush needed?
Comment 26 Chris Wilson 2014-01-20 12:56:52 UTC
(In reply to comment #25)
> I was trying to trigger page flip error with 2.99.907-31-gde0797e
> --enable-debug=full build with no effect. Tried Spotify client as Joe
> suggested (running in wine, right?), got X Server termination with:
> 
> (EE) sna_copy_boxes: damage box is beyond the pixmap: box=(1, 1), (32, 33),
> pixmap=(32, 32)

Do you still have the Xorg.0.log? That's a different (and very unusual) bug that needs to be fixed.
Comment 27 Janusz 2014-01-20 13:44:38 UTC
Created attachment 92450 [details]
Xorg.0.log --enable-debug-full damage box is beyond the pixmap

Log with "damage box is beyond the pixmap", about 1000 lines from the top and down. I hope it's usefully cropped.
Comment 28 Chris Wilson 2014-01-20 14:38:48 UTC
Thanks, that is most peculiar. For an inexplicable reason the clip-to-dst-clip didn't seem to be sufficient, so I added some more debug output. Then I noticed that it should be a fallback anyway, so provided a shorter fallback path (and hopefully beefed up the debugging to catch the error along that path as well).

Long story short, please try to capture a fresh debug log with xf86-video-intel.git:

commit 4c7b183fd21b461f9f18662c3b9d9732b6bef13d
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Jan 20 14:35:55 2014 +0000

    sna: Short-cut the fallback for XCopyArea with depth < 8
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

commit 671658499bf432666a96b31ac96d2c66e2168c4c
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Jan 20 14:35:35 2014 +0000

    sna: Add some more DBG output around the clipping in sna_do_copy()
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 29 Janusz 2014-01-20 17:17:21 UTC
Created attachment 92469 [details]
damage box is beyond the drawable 2.99.907-38-g4c7b183

About it being a different bug - I'm really new to bugzilla, should it be moved to a new bug report?
Comment 30 Joe Peterson 2014-01-20 17:19:58 UTC
Interesting about the different problem triggered by Spotify.  I still get the page flip issue using Spotify - could these two problems be related?  I wish I could provide more meaningful debug info on my case...
Comment 31 Chris Wilson 2014-01-20 20:25:48 UTC
(In reply to comment #29)
> Created attachment 92469 [details]
> damage box is beyond the drawable 2.99.907-38-g4c7b183
> 
> About it being a different bug - I'm really new to bugzilla, should it be
> moved to a new bug report?

Yes, I was hoping it would be trivial enough to fix immediately, but even then a separate bug is much better for tracking in changelogs and NEWS.
Comment 32 Chris Wilson 2014-01-20 21:38:40 UTC
What's happening in this bug digression is that we miss recomputing the gc->pCompositeClip. Strange, very strange. Again, I have added some extra debugging to narrow down why,

commit 50f6701aa5ce8be96e216a942880a8db967c7a6a
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Jan 20 20:47:15 2014 +0000

    sna: Include serial numbers in ValidateGC DBG
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

and would appreciate it if you could gather another debug log for me. Thanks.
Comment 33 Chris Wilson 2014-01-22 12:13:50 UTC
Ok, I believe xf86-video-intel.git has all the fixes for the assertions you have all hit, so we should be ready for another shot at getting the vital debug for the pageflipping failure. Any volunteers?
Comment 34 Joe Peterson 2014-01-22 13:51:48 UTC
Chris, traveling today, so no time this AM, but I can try it either this evening or tomorrow...
Comment 35 Chris Wilson 2014-01-28 22:41:14 UTC
Any one had a chance to run with full debug recently?
Comment 36 Joe Peterson 2014-01-29 01:25:39 UTC
Chris,

I put some log files here:

    http://data.boulder.swri.edu/~joe/

Xorg.0.log and Xorg.0.log3 both happened with TearFree on using the Spotify search test.  I fear not much of use is in these, but please look (I looked at the tail of each).

Xorg.0.log2 happened with TearFree off just going into Chromium - unusual.

xorg_output.log3 accompanies Xorg.0.log3 as the stdout/stderr from startx during the session.  Looks like another assertion.
Comment 37 Chris Wilson 2014-01-29 05:58:25 UTC
(In reply to comment #36)
> Chris,
> 
> I put some log files here:
> 
>     http://data.boulder.swri.edu/~joe/
> 
> Xorg.0.log and Xorg.0.log3 both happened with TearFree on using the Spotify
> search test.  I fear not much of use is in these, but please look (I looked
> at the tail of each).

You underestimate the value of your debugging - it was very, very useful. Thank you.

commit 4b73a0ea22b43807c0118f4d7e9dcac3f0626463
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Jan 29 05:32:25 2014 +0000

    sna: Skip undamaged TearFree redisplays
    
    If we have not had cause to flush the wait_for_shadow buffer during the
    course of the rendering, then we never wrote to the backbuffer and its
    contents are still identical to the current frontbuffer. So if the
    wait_for_shadow is still flagged as required on the scanout, we know we
    can safely discard the redisplay request.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=70905
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Time to try again!

> Xorg.0.log2 happened with TearFree off just going into Chromium - unusual.

Another assertion - not sure which, at a guess it decided the pointer was incoherent, which looks more like a bogus assertion...
Comment 38 Joe Peterson 2014-01-29 16:32:07 UTC
(In reply to comment #37)
> You underestimate the value of your debugging - it was very, very useful.
> Thank you.

Good to know!  :)

> Time to try again!

OK, pulled git and rebuilt this morning...  Good news, I think.  New files are in the same place:

    http://data.boulder.swri.edu/~joe/

> > Xorg.0.log2 happened with TearFree off just going into Chromium - unusual.
> 
> Another assertion - not sure which, at a guess it decided the pointer was
> incoherent, which looks more like a bogus assertion...

I tried a similar case (starting X and running Chromium with TearFree off).  No crash (but maybe that assertion is hard to hit).  Log files named with "_noTearFree_chromium".

I tried the Spotify with TearFree case, and no crash there, either!  Log files names with "_TearFree_spotify".

I'll keep running with the debug version for a while with TearFree on.  I wonder if this last round of changes did the trick - fingers crossed!
Comment 39 Chris Wilson 2014-01-29 16:41:54 UTC
(In reply to comment #38)
> (In reply to comment #37)
> > You underestimate the value of your debugging - it was very, very useful.
> > Thank you.
> 
> Good to know!  :)
> 
> > Time to try again!
> 
> OK, pulled git and rebuilt this morning...  Good news, I think.  New files
> are in the same place:

Heh, apparently '%D' is a typo that we need to be told about at 60Hz.

>     http://data.boulder.swri.edu/~joe/
> 
> > > Xorg.0.log2 happened with TearFree off just going into Chromium - unusual.
> > 
> > Another assertion - not sure which, at a guess it decided the pointer was
> > incoherent, which looks more like a bogus assertion...
> 
> I tried a similar case (starting X and running Chromium with TearFree off). 
> No crash (but maybe that assertion is hard to hit).  Log files named with
> "_noTearFree_chromium".

Had another bug report that revealed I screwed up in one of the patches yesterday which can explain all manner of strange failures.

> I tried the Spotify with TearFree case, and no crash there, either!  Log
> files names with "_TearFree_spotify".
> 
> I'll keep running with the debug version for a while with TearFree on.  I
> wonder if this last round of changes did the trick - fingers crossed!

*fingers crossed*
Certainly the issue in theory fixed by the patch this morning has the right hallmarks of being a random, hard to trigger bug that would end up not only causing the page-flip error but also the blit failure -- i.e. I think it explains this bug...
Comment 40 Chris Wilson 2014-01-30 11:01:36 UTC
I think Joe's happy - anyone else have a pass/fail report for the latest ddx + TearFree?

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.