Bug 99001 - [HSW] GPU HANG: ecode 7:0:0x85dffffc, in glxspheres64 [4492], reason: Hang on render ring, action: reset
Summary: [HSW] GPU HANG: ecode 7:0:0x85dffffc, in glxspheres64 [4492], reason: Hang on...
Status: RESOLVED MOVED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/DRI/i965 (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: highest blocker
Assignee: Chris Wilson
QA Contact: Intel 3D Bugs Mailing List
URL:
Whiteboard:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2016-12-05 21:58 UTC by webstrand
Modified: 2019-09-25 18:59 UTC (History)
2 users (show)

See Also:
i915 platform: HSW
i915 features: GPU hang


Attachments
/sys/class/drm/card0/error (12.75 KB, text/plain)
2016-12-05 21:58 UTC, webstrand
Details
dmesg log from hang (161.65 KB, text/plain)
2016-12-05 21:59 UTC, webstrand
Details
/sys/class/drm/card0/error for 4.7 (3.01 MB, text/plain)
2016-12-07 16:32 UTC, webstrand
Details
/sys/class/drm/card0/error for weston (14.21 KB, text/plain)
2016-12-09 17:05 UTC, webstrand
Details

Description webstrand 2016-12-05 21:58:11 UTC
Created attachment 128350 [details]
/sys/class/drm/card0/error

Upon opening any program which uses OpenGL, the GPU hangs.

System environment:
-- chipset: Intel(R) Core(TM) i7-4700MQ CPU 
-- system architecture: x86_64
-- xf86-video-intel: 2.99.917.740.g9ac7a33-1
-- xserver: 1.18.4-1
-- mesa: 13.0.2-2
-- libdrm: 2.4.74.r14.g0825792-1
-- kernel: 4.9.0-rc8inteldri+ commit 721d484 of intel-drm branch
-- Linux distribution: Archlinux
-- Machine: MSI GE60 2OE

Reproducing steps:

Run glxspheres64, and the GPU will hang.

Additional info:

I haven't been able to bisect the issue, v3.8 to HEAD exhibit the hang. From v3.8 to v3.12 the GPU doesn't consistently hang, and from v3.13 onward the GPU hangs consistently.

The offending line in glxspheres.c which causes the GPU to hang appears to be:

    GLXFBConfig *c=glXChooseFBConfig(dpy, screen, rgbAttribs, &n);
Comment 1 webstrand 2016-12-05 21:59:02 UTC
Created attachment 128351 [details]
dmesg log from hang
Comment 2 yann 2016-12-06 08:45:02 UTC
Assigning to Mesa product (please let me know if I am mistaken with this GPU Hang).

Kernel: 4.9.0-rc8inteldri+ commit 721d484 of intel-drm branch
Platform: Haswell (pci id: 0x0416, pci revision: 0x06, pci subsystem: 1462:10e0)
Mesa: 13.0.2-2

From this error dump, hung is happening in render ring batch with active head at 0x018542c4, with 0x7a000003 (PIPE_CONTROL) as IPEHR.

We can also note ERROR: 0x00000101 [TLB page fault error (GTT entry not valid)]
and then in render ring "Unloaded PD Fault (PPGTT)"

Batch extract (around 0x018542c4):

0x01854294:      0x7b000005: 3DPRIMITIVE:
0x01854298:      0x00000006:    tri fan sequential
0x0185429c:      0x00000004:    vertex count
0x018542a0:      0x00000000:    start vertex
0x018542a4:      0x00000001:    instance count
0x018542a8:      0x00000000:    start instance
0x018542ac:      0x00000000:    index bias
0x018542b0:      0x7a000003: PIPE_CONTROL
0x018542b4:      0x00101001:    no write, cs stall, render target cache flush, depth cache flush,
0x018542b8:      0x00000000:    destination address
0x018542bc:      0x00000000:    immediate dword low
0x018542c0:      0x00000000:    immediate dword high
0x018542c4:      0x7a000003: PIPE_CONTROL
0x018542c8:      0x00000c10:    no write, instruction cache invalidate, texture cache invalidate, vf fetch invalidate,
0x018542cc:      0x00000000:    destination address
0x018542d0:      0x00000000:    immediate dword low
0x018542d4:      0x00000000:    immediate dword high
0x018542d8:      0x780e0000: 3DSTATE_CC_STATE_POINTERS
0x018542dc:      0x00007f01:    pointer to COLOR_CALC_STATE at 0x00007f00 (changed)
Comment 3 Mark Janes 2016-12-06 18:06:07 UTC
This demo works properly with latest mesa, on haswell, using a stable kernel (4.7).

Yann, can you reproduce on the specified intel-drm kernel, and bisect the kernel regression?
Comment 4 yann 2016-12-07 13:13:41 UTC
(In reply to Mark Janes from comment #3)
> This demo works properly with latest mesa, on haswell, using a stable kernel
> (4.7).
> 
> Yann, can you reproduce on the specified intel-drm kernel, and bisect the
> kernel regression?

Thanks Mark, I will setup env on my side.

In the meantime webstrand@gmail.com can you confirm that it is working on your side with an earlier kernel? Regression?
Comment 5 webstrand 2016-12-07 16:32:46 UTC
Unfortunately, linux 4.7.0-1-ARCH also exhibits the GPU HANG.
Comment 6 webstrand 2016-12-07 16:32:58 UTC
Created attachment 128370 [details]
/sys/class/drm/card0/error for 4.7
Comment 7 Kenneth Graunke 2016-12-08 10:40:18 UTC
(In reply to webstrand from comment #0)
> Upon opening any program which uses OpenGL, the GPU hangs.

I'm not really sure what to say.  A lot of people use Archlinux with Haswell GT2 and OpenGL programs work just fine.  I suspect this is something specific to your setup, but I'm not sure what it would be.

It might be worth trying X with the modesetting driver instead of intel/SNA.  I don't know that it's the problem, but that would eliminate one of the variables.  (Easiest way to accomplish this is to pacman -R xf86-video-intel).
Comment 8 webstrand 2016-12-09 02:20:46 UTC
I was unable to get xorg to start without xf86-video-intel. I can invest some more time in figuring out what's wrong, if necessary.

However, I've installed a fresh copy of Ubuntu 14.04.4 on which I can also reproduce the GPU hang. I was able to successfully bisect the mainline kernel between v3.12 and v3.13.

    first bad commit: [b29c19b645287f7062e17d70fa4e9781a01a5d88] drm/i915: Boost RPS frequency for CPU stalls
Comment 9 Matt Turner 2016-12-09 02:53:07 UTC
Wonderful. Thank you for bisecting!
Comment 10 Chris Wilson 2016-12-09 07:49:04 UTC
(In reply to webstrand from comment #8)
> I was unable to get xorg to start without xf86-video-intel. I can invest
> some more time in figuring out what's wrong, if necessary.
> 
> However, I've installed a fresh copy of Ubuntu 14.04.4 on which I can also
> reproduce the GPU hang. I was able to successfully bisect the mainline
> kernel between v3.12 and v3.13.
> 
>     first bad commit: [b29c19b645287f7062e17d70fa4e9781a01a5d88] drm/i915:
> Boost RPS frequency for CPU stalls

False bisect result unfortunately.

(In reply to Kenneth Graunke from comment #7)
> (In reply to webstrand from comment #0)
> > Upon opening any program which uses OpenGL, the GPU hangs.
> 
> I'm not really sure what to say.  A lot of people use Archlinux with Haswell
> GT2 and OpenGL programs work just fine.  I suspect this is something
> specific to your setup, but I'm not sure what it would be.
> 
> It might be worth trying X with the modesetting driver instead of intel/SNA.
> I don't know that it's the problem, but that would eliminate one of the
> variables.  (Easiest way to accomplish this is to pacman -R
> xf86-video-intel).

Completely bogus and unhelpful.
Comment 11 webstrand 2016-12-09 17:04:27 UTC
(In reply to Chris Wilson from comment #10)
> False bisect result unfortunately.

I'm guessing you mean that the commit I referenced is another bug which has already been fixed? Would it be worth trying to bisect again? The last time I was able to use opengl on this laptop was with linux v3.12.

(In reply to Kenneth Graunke from comment #7)
> It might be worth trying X with the modesetting driver instead of intel/SNA.

In my xorg config, I've set:
    Option      "NoAccel" "True"
And I've set LIBGL_ALWAYS_SOFTWARE=1 globally, which I unset temporarily for testing glxspheres64.

In an effort to reproduce the issue without using xorg, I installed weston. Launching weston causes the GPU hang about every 1 in 3 launches.
Comment 12 webstrand 2016-12-09 17:05:00 UTC
Created attachment 128394 [details]
/sys/class/drm/card0/error for weston
Comment 13 Jari Tahvanainen 2016-12-19 09:36:12 UTC
Highest+Blocker as being regression w/o workaround
Comment 14 Jari Tahvanainen 2017-01-27 14:13:36 UTC
Yann, use five minutes and check out if new crash log give any new indication about the reason for hang.
I also removed bisected keyword since assumable that is not the case.
Comment 15 yann 2017-01-30 15:13:29 UTC
(In reply to Jari Tahvanainen from comment #14)
> Yann, use five minutes and check out if new crash log give any new
> indication about the reason for hang.
> I also removed bisected keyword since assumable that is not the case.

According to mesa engineers, mesa only emits 3DSTATE_VERTEX_ELEMENTS on-demand right before 3DPRIMITIVE.

Chris has changed SNA to emit a dummy primitive between VertexElements in  4acd4a7d3d2f41227022fa7581cfb85a0b124eae in xf86-video-intel (thanks to https://cgit.freedesktop.org/xorg/driver/xf86-video-intel/commit/?id=4acd4a7d3d2f41227022fa7581cfb85a0b124eae) but this is for gen9. Here we are dealing with gen7, do we need such mechanism as well in gen7_emit_vertex_elements?

An alternate solution is to use Glamor/modesetting.

*Details:
- Kernel: 4.9.0-rc8inteldri+
- Platform: Haswell (PCI ID: 0x0416, PCI Revision: 0x06, PCI Subsystem: 1462:10e0)
- Mesa: 13.0.2-2
- xf86-video-intel: 2.99.917.740.g9ac7a33-1



0x008a0540:      0x78090001: 3DSTATE_VERTEX_ELEMENTS
0x008a0544:      0x02850000:    buffer 0: valid, type 0x0085, src offset 0x0000 bytes
0x008a0548:      0x11230000:    (X, Y, 0.0, 1.0), dst offset 0x00 bytes
0x008a054c:      0x7b000005: 3DPRIMITIVE:
0x008a0550:      0x00000006:    tri fan sequential
0x008a0554:      0x00000004:    vertex count
0x008a0558:      0x00000000:    start vertex
0x008a055c:      0x00000001:    instance count
0x008a0560:      0x00000000:    start instance
0x008a0564:      0x00000000:    index bias
0x008a0568:      0x7a000003: PIPE_CONTROL
0x008a056c:      0x00101001:    no write, cs stall, render target cache flush, depth cache flush,
0x008a0570:      0x00000000:    destination address
0x008a0574:      0x00000000:    immediate dword low
0x008a0578:      0x00000000:    immediate dword high
0x008a057c:      0x7a000003: PIPE_CONTROL
0x008a0580:      0x00000c10:    no write, instruction cache invalidate, texture cache invalidate, vf fetch invalidate,
0x008a0584:      0x00000000:    destination address
0x008a0588:      0x00000000:    immediate dword low
0x008a058c:      0x00000000:    immediate dword high
Comment 16 GitLab Migration User 2019-09-25 18:59:12 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1551.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.