73644 – [gen4] GPU reset fails

Bug 73644 - [gen4] GPU reset fails

Summary: [gen4] GPU reset fails

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	XOrg git
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium major
Assignee:	Ville Syrjala
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Duplicates (1):	73662 (view as bug list)
Depends on:
Blocks:

Reported:	2014-01-15 04:21 UTC by drrossum
Modified:	2017-07-24 22:56 UTC (History)
CC List:	6 users (show)

See Also:
i915 platform:
i915 features:

Attachments
Kernel log showing the backtrace of i915 driver crash (19.22 KB, text/plain) 2014-01-15 04:21 UTC, drrossum	no flags	Details
earlier dmesg output showing drm and i915 errors before backtrace dump (4.47 KB, text/plain) 2014-01-15 15:57 UTC, drrossum	no flags	Details
Xorg.0.log for a session where display crashes (21.71 KB, text/plain) 2014-01-15 16:54 UTC, drrossum	no flags	Details
/sys/class/drm/card0/error (2.10 MB, text/plain) 2014-01-15 17:05 UTC, drrossum	no flags	Details
dmesg log (2) (106.44 KB, text/plain) 2014-01-21 06:52 UTC, Andrey Skvortsov	no flags	Details
/sys/class/drm/card0/error (2) (761.78 KB, text/plain) 2014-01-21 06:53 UTC, Andrey Skvortsov	no flags	Details
Xorg.0.log from session with a crash (36.13 KB, text/plain) 2014-01-28 11:59 UTC, Maciej Łoziński	no flags	Details
dmesg after wedge test (660 bytes, text/plain) 2014-09-05 14:10 UTC, drrossum	no flags	Details
/sys/class/drm/card0/error after wedge test (674.33 KB, text/plain) 2014-09-05 14:10 UTC, drrossum	no flags	Details
View All

Description drrossum 2014-01-15 04:21:40 UTC

Created attachment 92111 [details]
Kernel log showing the backtrace of i915 driver crash

After some random time in X the display turns black and can not be brought back.  Only after putting the machine to sleep and back on restores the display.

The problem seems to occur when switching workspaces.  Not sure if this is always the case, though.

Reverting back to driver version xf86_video_intel 2.21.15 resolves the problem.

See also this bug report:
https://bugs.archlinux.org/task/38518

The kernel log including backtrace is attached.

Comment 1 Chris Wilson 2014-01-15 11:45:42 UTC

Note that the dmesg here is from a lid-event, which is not spectacularly random. Perhaps you have more than one bug here? Can you please attach your Xorg.0.log as well?

Comment 2 drrossum 2014-01-15 15:57:16 UTC

Created attachment 92157 [details]
earlier dmesg output showing drm and i915 errors before backtrace dump

I updated to version 2.99.907 again, now I'm waiting for the random moment to happen.  Will attach Xorg.0.log when I get it.

I had tried a lid event to get the screen back to life (which did not help).

Here I attached the dmesg just before the lid event backtrace dump.  It contains related messages.

Comment 3 drrossum 2014-01-15 16:54:03 UTC

Created attachment 92165 [details]
Xorg.0.log for a session where display crashes

There is little useful output in the Xorg.0.log I fear.  Is there a way I can increase verbosity?

This crash happened completely random, out of the blue, no workspace switches or other activity involved.  I was reading a webpage.

Comment 4 drrossum 2014-01-15 16:58:19 UTC

I can also confirm that you (Chris) are right that the backtrace dump (attachement 1) does not appear in dmesg at the time of the display crash.  What does appear is this:

Jan 15 10:49:15 idefix kernel: [drm] stuck on render ring
Jan 15 10:49:15 idefix kernel: [drm] capturing error event; look for more information in /sys/class/drm/card0/error
Jan 15 10:49:15 idefix kernel: [drm:i915_set_reset_status] *ERROR* render ring hung inside bo (0x3702000 ctx 0) at 0x370373c
Jan 15 10:49:15 idefix kernel: [drm:i915_reset] *ERROR* Failed to reset chip.

Comment 5 Chris Wilson 2014-01-15 17:05:46 UTC

The hang is bug 73348, fixed by

commit 9d8473c5d9489db439aca73f470bda29a22ebab6
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Jan 7 13:43:35 2014 +0000

    sna/gen4: Check for available batch space before restoring state after CA pass
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=73348
    References: https://bugs.freedesktop.org/show_bug.cgi?id=55500
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

But the subsequent reset failure and display failure is unexpected.

Comment 6 drrossum 2014-01-15 17:05:52 UTC

Created attachment 92166 [details]
/sys/class/drm/card0/error

Comment 7 Chris Wilson 2014-01-15 17:11:26 UTC

*** Bug 73662 has been marked as a duplicate of this bug. ***

Comment 8 drrossum 2014-01-20 18:29:48 UTC

I've been using UXA for some time and not seen this problem so far.  Is that because the GPU is not reset in UXA or is the reset method different?

Comment 9 Chris Wilson 2014-01-20 19:31:34 UTC

(In reply to comment #8)
> I've been using UXA for some time and not seen this problem so far.  Is that
> because the GPU is not reset in UXA or is the reset method different?

The reset is due to a known issue in SNA, see bug 73348. That the reset fails is unusual and a separate issue.

Comment 10 Andrey Skvortsov 2014-01-21 06:50:16 UTC

I have this issue as well when I am testing latest mainline kernel on my Squeeze machine. With the stock 2.6.32 kernel system can run almost for ever. I confirm that this bug happens pretty random. System can run couple of days without an issue or only couple of minutes sometimes. The reason is not clear for me too. Sometimes I just read something on the web and do nothing special, sometimes it happens when I am starting app.  

I will attach below my dmesg and crash dump. I hope this will help.

[drm] stuck on render ring
[drm:i915_set_reset_status] *ERROR* render ring hung flushing bo (0x4c3f000 ctx 0) at 0x2ce03e14
[drm:i915_reset] *ERROR* Failed to reset chip

Comment 11 Andrey Skvortsov 2014-01-21 06:52:34 UTC

Created attachment 92503 [details]
dmesg log (2)

dmesg from system with the latest mainline kernel: Linux version 3.13.0-rc8-0105--00005-ga6da83f-dirty

Comment 12 Andrey Skvortsov 2014-01-21 06:53:46 UTC

Created attachment 92504 [details]
/sys/class/drm/card0/error (2)

Comment 13 Maciej Łoziński 2014-01-28 11:57:33 UTC

I don't know if it's about a GPU reset fails, I just have an issue described in Arch Linux bug #38518 . I have this problem with a system installed freshly on 22.01.2013. 

What is noticeable, only video blanks out. System runs normally, and I can close all windows using Alt-F4 and even cleanly close the system by trying to blindly click the power-down icon.

Linux xxxx 3.12.8-1-ARCH #1 SMP PREEMPT Thu Jan 16 09:16:34 CET 2014 x86_64 GNU/Linux
xorg-server 1.15.0-5
xf86-video-intel 2.99.907-2

00:00.0 Host bridge: Intel Corporation Mobile PM965/GM965/GL960 Memory Controller Hub (rev 0c)
00:02.0 VGA compatible controller: Intel Corporation Mobile GM965/GL960 Integrated Graphics Controller (primary) (rev 0c)
00:02.1 Display controller: Intel Corporation Mobile GM965/GL960 Integrated Graphics Controller (secondary) (rev 0c)

Comment 14 Maciej Łoziński 2014-01-28 11:59:30 UTC

Created attachment 92919 [details]
Xorg.0.log from session with a crash

Comment 15 Alexandre 2014-01-30 02:49:46 UTC

I also have a problem where the display goes blank out of nowhere (I noted at least twice it coincided with clicking on some button on qt/kde programs).

VT switching restores it though.
The X.log gives this:

[ 48895.399] (EE) intel(0): sna_mode_redisplay: page flipping failed, disabling CRTC:3 (pipe=0)

For each and every time that happens.

SNA compiled from 2.99.907-54-g294180b

Comment 16 Chris Wilson 2014-01-30 08:38:00 UTC

(In reply to comment #15)
> I also have a problem where the display goes blank out of nowhere (I noted
> at least twice it coincided with clicking on some button on qt/kde programs).
> 
> VT switching restores it though.
> The X.log gives this:
> 
> [ 48895.399] (EE) intel(0): sna_mode_redisplay: page flipping failed,
> disabling CRTC:3 (pipe=0)
> 
> For each and every time that happens.
> 
> SNA compiled from 2.99.907-54-g294180b

See bug 70905, in particular 2.99.907-61-g4b73a0e.

Comment 17 Chris Wilson 2014-01-30 08:41:34 UTC

(In reply to comment #15)
> I also have a problem where the display goes blank out of nowhere (I noted
> at least twice it coincided with clicking on some button on qt/kde programs).
> 
> VT switching restores it though.
> The X.log gives this:
> 
> [ 48895.399] (EE) intel(0): sna_mode_redisplay: page flipping failed,
> disabling CRTC:3 (pipe=0)
> 
> For each and every time that happens.
> 
> SNA compiled from 2.99.907-54-g294180b

See bug 70905, in particular 2.99.907-61-g4b73a0e.

Comment 18 drrossum 2014-02-04 15:54:58 UTC

I havn't run into the blank screen problem "[drm:i915_reset] *ERROR* Failed to reset chip" with version 908 and 909.

I'm not sure if that is because "[drm:i915_set_reset_status] *ERROR* render ring hung inside bo (0x3702000 ctx 0) at 0x370373c" no longer happens or if the GPU reset now works.  Is there a way I can test whether the GPU reset works on [gen4]?

Comment 19 Chris Wilson 2014-02-04 21:45:44 UTC

If you don't see the error message that a hang was detected, we haven't attempted to reset the GPU.

The simplest test for GPU reset is "echo 1 > /sys/kernel/debug/dri/0/i915_wedged". However, when the GPU hangs for real it is more likely for the reset to fail.

Comment 20 drrossum 2014-02-06 04:08:30 UTC

Just did this test.  It produces exactly the crash that I experienced with version 907:

[ 1540.594567] [drm] Manually setting wedged to 1
[ 1540.594580] [drm] capturing error event; look for more information in /sys/class/drm/card0/error
[ 1541.099920] [drm:i915_reset] *ERROR* Failed to reset chip.

I can revive the GPU by putting the whole system to sleep and wake-up again.

I guess that means that this bug is not resolved yet, but as long as the GPU doesn't hang the driver doesn't try to reset and this bug is not triggered.

Comment 21 drrossum 2014-02-06 04:39:47 UTC

Just did this test.  It produces exactly the crash that I experienced with version 907:

[ 1540.594567] [drm] Manually setting wedged to 1
[ 1540.594580] [drm] capturing error event; look for more information in /sys/class/drm/card0/error
[ 1541.099920] [drm:i915_reset] *ERROR* Failed to reset chip.

I can revive the GPU by putting the whole system to sleep and wake-up again.

I guess that means that this bug is not resolved yet, but as long as the GPU doesn't hang the driver doesn't try to reset and this bug is not triggered.

Comment 22 Chris Wilson 2014-05-19 20:20:54 UTC

Ville has just posted a patch set and is looking for victims^W volunteers.

Comment 23 Ville Syrjala 2014-05-20 08:11:08 UTC

(In reply to comment #22)
> Ville has just posted a patch set and is looking for victims^W volunteers.

Patches pushed here for easier consumption:
git://gitorious.org/vsyrjala/linux.git gpu_reset_fixes_2

Comment 24 Jani Nikula 2014-09-05 12:04:37 UTC

Ville, what's the status of the patches? Upstreamed, forgotten, what?

drrossum, testing the patches helps in getting them upstreamed...

Comment 25 Ville Syrjala 2014-09-05 13:07:37 UTC

(In reply to comment #24)
> Ville, what's the status of the patches? Upstreamed, forgotten, what?

I didn't really spend much time on them, so they might have some issues, but at least my 946gz seemed to work with them. If someone wants to play around with them or improve them go ahead. I don't have time atm.

Comment 26 drrossum 2014-09-05 14:07:46 UTC

(In reply to comment #24)
> drrossum, testing the patches helps in getting them upstreamed...

I have not experienced any random crashes anymore after version 908 and 909, as noted in comment #18.  I'm now on 2.99.914.

I just tried the "i915_wedged" test that Chris suggested in comment #19.  It does NOT crash the driver anymore.  I have attached /sys/class/drm/card0/error and the tail of dmesg in case anyone is interested.

I mark this bug as resolved.

Comment 27 drrossum 2014-09-05 14:10:09 UTC

Created attachment 105803 [details]
dmesg after wedge test

see comment #26

Comment 28 drrossum 2014-09-05 14:10:51 UTC

Created attachment 105804 [details]
/sys/class/drm/card0/error after wedge test

See comment #26

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.