Bug 98706

Summary:	[IVB] GPU HANG: ecode 7:0:0x85fffffa, in X [16159], reason: Hang on render ring, action: reset (after resume from hibernation on ThinkPad x230)
Product:	Mesa	Reporter:	Eugene A. Shatokhin <eugene.shatokhin>
Component:	Drivers/DRI/i965	Assignee:	Intel 3D Bugs Mailing List <intel-3d-bugs>
Status:	CLOSED WORKSFORME	QA Contact:	Intel 3D Bugs Mailing List <intel-3d-bugs>
Severity:	normal
Priority:	medium	CC:	intel-gfx-bugs
Version:	12.0
Hardware:	x86-64 (AMD64)
OS:	Linux (All)
Whiteboard:
i915 platform:	IVB	i915 features:	GPU hang
Attachments:	Gzipped contents of /sys/class/drm/card0/error

Description Eugene A. Shatokhin 2016-11-13 13:57:32 UTC

Created attachment 127950 [details]
Gzipped contents of /sys/class/drm/card0/error

When the system resumes from hibernate on my ThinkPad x230, GPU hang is reported in dmesg and the X11 server restarts.

So far the problem happens only if I use "modesetting" X11 driver and does not occur if I use the Intel's X11 driver.

OS: ROSA Linux x86_64
GPU: VGA compatible controller [0300]: Intel Corporation 3rd Gen Core processor Graphics Controller [8086:0166] (rev 09) (prog-if 00 [VGA controller])

Kernel: I checked version 4.8.6 and 4.8.7 (4.8.6-nrj-desktop-1rosa-x86_64, 4.8.7-nrj-desktop-1rosa-x86_64).

X11 server: 1.17.4

I have been using kernel 4.8.6 and "modesetting" X11 driver for about a week now, hibernated and resumed the laptop a dozen times without any problems. Since yesterday, however, GPU hang happens each time I resume the laptop from hibernation.

By the way, I used the following file in /etc/xorg/conf.d to switch to "modesetting" X11 driver:
51-modesetting.conf:
--------------------
Section "Device"
	Identifier  "Device0"
	Driver      "modesetting"
	Option      "AccelMethod"    "glamor"
EndSection
--------------------

I switched to the Intel's X11 driver (git rev. 8f33f80 as of 2016-09-23) and the problem does not show up anymore. When I switched back to "modesetting" driver, the problem appeared again.

I tried kernel 4.8.7 - the results are the same as for 4.8.6.

From dmesg after the problem happened:
-------------------
[  514.736258] [drm] GPU HANG: ecode 7:0:0x85fffffa, in X [16159], reason: Hang on render ring, action: reset
[  514.736259] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[  514.736259] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[  514.736259] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  514.736260] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[  514.736260] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[  514.736301] drm/i915: Resetting chip after gpu hang
[  525.744243] drm/i915: Resetting chip after gpu hang
-------------------

Full dmesg: https://linux-hardware.org/index.php?probe=0f9af40d17&log=dmesg

lspci: https://linux-hardware.org/index.php?probe=0f9af40d17&log=lspci_all

Other info about the hardware and the system:
https://linux-hardware.org/index.php?probe=0f9af40d17

Before the problem started to occur, if that matters, I updated mesa (12.0.3 => 12.0.4), cpupower and installed kernel 4.8.7 alongside 4.8.6.

Comment 1 Eugene A. Shatokhin 2016-11-13 14:00:25 UTC

If needed, I can test the patches to the kernel or X11 server.

However, I have to stick to X11 server 1.17.4 for another month or two, so I cannot update it as a whole. Still, I can try patches to it.

Comment 2 yann 2016-11-14 15:42:44 UTC

There were improvements pushed in kernel and Mesa (13)that will benefit to your system, so please re-test with latest kernel & Mesa to see if this issue is still occurring: mark as REOPENED if you can reproduce and RESOLVED/* if you cannot reproduce.

In parallel, assigning to Mesa product (please let me know if I am mistaken with this GPU Hang).

Kernel: 4.4.0-rc6-mainline
Platform: Ivybridge (pci id: 0x0166, pci revision: 0x09, pci subsystem: 17aa:21fa)
Mesa: 12.0.4

From this error dump, hung is happening in render ring batch with active head at 0x00dcf294, with 0x7a000003 (PIPE_CONTROL) as IPEHR.

We can note also: ERROR: 0x00000101
    TLB page fault error (GTT entry not valid)
    Cacheline containing a PD was marked as invalid

and in render batch: Unloaded PD Fault (PPGTT)

Batch extract (around 0x00dcf294):

0x00dcf264:      0x7b000005: 3DPRIMITIVE:
0x00dcf268:      0x00000104:    tri list random
0x00dcf26c:      0x00000006:    vertex count
0x00dcf270:      0x00000000:    start vertex
0x00dcf274:      0x00000001:    instance count
0x00dcf278:      0x00000000:    start instance
0x00dcf27c:      0x00000000:    index bias
0x00dcf280:      0x7a000003: PIPE_CONTROL
0x00dcf284:      0x00101001:    no write, cs stall, render target cache flush, depth cache flush,
0x00dcf288:      0x00000000:    destination address
0x00dcf28c:      0x00000000:    immediate dword low
0x00dcf290:      0x00000000:    immediate dword high
0x00dcf294:      0x7a000003: PIPE_CONTROL
0x00dcf298:      0x00000408:    no write, texture cache invalidate, constant cache invalidate,
0x00dcf29c:      0x00000000:    destination address
0x00dcf2a0:      0x00000000:    immediate dword low
0x00dcf2a4:      0x00000000:    immediate dword high
0x00dcf2a8:      0x78210000: 3DSTATE_VIEWPORT_STATE_POINTERS_SF_CLIP
0x00dcf2ac:      0x00007dc0:    pointer to SF_CLIP viewport

Comment 3 Eugene A. Shatokhin 2016-11-15 07:55:39 UTC

(In reply to yann from comment #2)

Thanks for a quick reply!

> There were improvements pushed in kernel and Mesa (13)that will benefit to
> your system, so please re-test with latest kernel & Mesa to see if this
> issue is still occurring: mark as REOPENED if you can reproduce and
> RESOLVED/* if you cannot reproduce.

Yes, I will re-test it with Mesa 13.x and the mainline kernel 4.9 (or, do you suggest another git tree?), hopefully, later this week. 

> 
> Kernel: 4.4.0-rc6-mainline
It is 4.8.7 on that system, actually.

> Platform: Ivybridge (pci id: 0x0166, pci revision: 0x09, pci subsystem:
> 17aa:21fa)
> Mesa: 12.0.4

Comment 4 yann 2016-11-15 08:03:45 UTC

(In reply to Eugene A. Shatokhin from comment #3)
> (In reply to yann from comment #2)
> 
> Thanks for a quick reply!
> 
> > There were improvements pushed in kernel and Mesa (13)that will benefit to
> > your system, so please re-test with latest kernel & Mesa to see if this
> > issue is still occurring: mark as REOPENED if you can reproduce and
> > RESOLVED/* if you cannot reproduce.
> 
> Yes, I will re-test it with Mesa 13.x and the mainline kernel 4.9 (or, do
> you suggest another git tree?), hopefully, later this week. 
Thanks Eugene, current mainline is fine :)

> 
> > 
> > Kernel: 4.4.0-rc6-mainline
> It is 4.8.7 on that system, actually.
> 
You are right, bad copy'n paste :^(. To be accurate this is : 4.8.7-nrj-desktop-1rosa-x86_64

Comment 5 Eugene A. Shatokhin 2016-11-15 14:56:55 UTC

I have updated Mesa to 13.0.1 and libdrm to 2.4.73 while keeping the kernel the same for now. No problem after resume so far. 

Will monitor the system for a couple days, let us see if the issue shows up again.

Comment 6 Eugene A. Shatokhin 2016-11-17 07:01:47 UTC

OK, several hibernate-resume cycles in 2 days - the problem haven't shown up. Let us assume Mesa 13.0.1 and/or libdrm update fixed it.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.