Bug 90342

Summary:

GPU hang "stuck on render ring" after resume on broadwell

Product:

DRI

Reporter:

Chaskiel Grundman <cgrundman>

Component:

DRM/Intel

Assignee:

Intel GFX Bugs mailing list <intel-gfx-bugs>

Status:

CLOSED FIXED

QA Contact:

Intel GFX Bugs mailing list <intel-gfx-bugs>

Severity:

normal

Priority:

medium

CC:

intel-gfx-bugs, tuxgirl

Version:

unspecified

Hardware:

x86-64 (AMD64)

OS:

Linux (All)

Whiteboard:

i915 platform:

i915 features:

Attachments:

Description	Flags
drm error file	none
xorg log	none

Description Chaskiel Grundman 2015-05-06 15:48:25 UTC

Created attachment 115596 [details]
drm error file

I installed 2.99.917 of the intel driver on an otherwise stock debian 8.0 system (3.16 kernel). 

Now, whenever I resume the system after suspend (or hibernate), X refreshes the screen, and then seems to be unable to draw anything. a gpu hang is found in dmesg. I can vt switch in and out of X, and the console works, but X never recovers.

If I run the debian experimental 4.0 kernel, this does not happen (but I do have VT switching and non-video issues).

Comment 1 Chaskiel Grundman 2015-05-06 15:49:34 UTC

Created attachment 115597 [details]
xorg log

Comment 2 Ander Conselvan de Oliveira 2015-05-11 08:46:57 UTC

This is fixed upstream, please file a separate bug report for the VT issue if it happens with recent kernels.

Comment 3 tuxgirl 2015-05-23 03:29:29 UTC

What kernel version is this fixed in? I just got this error on 4.0.4-1-ARCH. Unfortunately, the GPU crash dump is 0b, so I don't think there's much point to me creating a new ticket.

Comment 4 Jerome 2015-08-04 20:50:30 UTC

In reply to comment #3, check bug #89915 comment #7 indicating a fix in 4.0.7 and pointing to a patch.

In my case, just like the initial reporter I'm running Debian Jessie (stable) with a backported Intel driver 2.99.917. Same hang issue with the stock kernel, but installing kernel 4.0.8 from Debian unstable fixed the issue.

Before updating the kernel I also tried the "i915.enable_execlists=0" work-around and it worked ok: either no hang, or a hang but quickly and automatically recovered.

Comment 5 Jerome 2015-08-04 21:34:22 UTC

I may have been too optimistic in comment #4 unfortunately...

With the stock kernel and without the "execlists", the bug was systematic.

With kernel 4.0.8 the bug seems much less frequent (2 times over a dozen suspend/resume). Unfortunately when it hangs it locks solid and requires a forced power down. No log on reboot.

Ander, could you clarify what fix you had in mind when you closed this bug? If this is the 4.0.7 fix mentioned in bug #89915 it's not a complete fix for me. I couldn't find another pointer to a fix (but may have missed it).

Thanks

Comment 6 Ander Conselvan de Oliveira 2015-08-06 12:05:16 UTC

(In reply to tuxgirl from comment #3)
> What kernel version is this fixed in? I just got this error on 4.0.4-1-ARCH.
> Unfortunately, the GPU crash dump is 0b, so I don't think there's much point
> to me creating a new ticket.

The GPU crash dump file is never 0b, but ls lies. Just do 

  $ cat /sys/class/drm/card0/error | bz2 > error.bz2

once you get the GPU hang and attach that to a *new* ticket.


(In reply to Jerome from comment #5)
> I may have been too optimistic in comment #4 unfortunately...
> With kernel 4.0.8 the bug seems much less frequent (2 times over a dozen
> suspend/resume). Unfortunately when it hangs it locks solid and requires a
> forced power down. No log on reboot.

Your symptoms are different so lets not jump to conclusions and assume it's the same bug. Please open a *new* ticket, following the instructions in

  https://01.org/linuxgraphics/documentation/how-report-bugs

Comment 7 Jerome 2015-08-06 21:44:07 UTC

Hi Ander,

Thanks for the feedback. My starting issue is the exact same as Chaskiel, it's just that I then tried different kernels than him (keeping with Debian, testing 4.0.8 and then unstable 4.1.3) and ended up with the non systematic hard system lock on resume from suspend. I filled a Debian bug (794393) and was sent upstream here.

Since then I read the "how to" here indeed, compiled the Intel DRM nightly kernel yesterday [1], and it looks fine: I went through 20 suspend / resume cycles without issue (usually I got the lock within 5 resumes typically). So yes, it looks like it's fixed upstream.

Thanks

[1] Tested with commit fb4572c00fadc1ac94816061e76c65b65607f66a /
    drm-intel-nightly: 2015y-08m-05d-15h-33m-02s UTC integration manifest
    Based on 4.2.0 RC5.

Comment 8 Jerome 2015-08-08 10:42:23 UTC

I spoke too quickly once again. I had a lock-up after leaving the PC for over a day suspended, then on resume the usual lock-up when switching to graphic mode, with a dark blank screen. With the nightly kernel indicated in my previous comment.
As usual, it required a forced power down and reboot and left no trace. That looks like a tough one to debug...

As requested by Ander I will open a new kernel bug and put the reference here.

Comment 9 Jerome 2015-08-08 12:03:57 UTC

Separate bug filled as bug #91585.

Nightly has the issue, the lock-up had nothing to do with the delay but with the DRM logs. When doing the 20 suspend/resume tests I had the DRM logs enabled (0x1e) as requested by the howto. Afterward I removed the log option and performed a clean reboot, then suspended. With a lock-up on resume. 
I made another test without extra DRM logs, lock-up on second suspend/resume.
So definitely the lock-up issue is still there in nightly. It's just that lots of logs hide it.
Will follow-up on bug #91585.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.