Bug 106243

Summary: [kbl] GPU HANG: 9:0:0x85dffffb, in Cinnamon
Product: Mesa Reporter: Gary <garyfrombugzilla>
Component: Drivers/DRI/i965Assignee: Gary <garyfrombugzilla>
Status: RESOLVED WORKSFORME QA Contact: Intel 3D Bugs Mailing List <intel-3d-bugs>
Severity: critical    
Priority: high CC: intel-gfx-bugs, rafael.antognolli
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
i915 platform: KBL i915 features: GPU hang
Attachments: error log
dmesg output with drm.debug=0xe
dmesg hanging output with drm.debug=0xe
drm-tip dmesg & error with drm.debug=0xe
i965: require post sync operation priot to ISP disable
i965: require post sync operation priot to ISP disable
drm-tip dmesg & error with patch 139248
Hang with v4 patch

Description Gary 2018-04-26 05:24:28 UTC
Created attachment 139119 [details]
error log

Occasional constant GPU hangs on Clevo N130WU, even after reboots, unless shut down and leaving it for a while (?). Internal display screen freezes, maybe a black jagged triangle is drawn across top-right of desktop, recovers, then freezes. Intermittent, unreproducable, happens by opening a menu or mouse-over the panel, or using Firefox or MPV in MATE (both use GPU acceleration, MATE doesn't?). In x86_64 Mint 18.3 Cinnamon LiveUSB & Installed (and Fedora 27 Cinnamon LiveUSB). Updated with Linux 4.16.3 and kbl_dmc_ver1_04 and ppa:oibaf/graphics-drivers (Mesa 18.2.0-devel, libdrm 2.4.91, xserver-xorg-video-intel uninstalled, etc). GPU crash dumps attached, unable to dump VBIOS "cat: '/sys/devices/pci0000:00/0000:00:02.0/rom': Input/output error", "stolen memory" not in BIOS (?). I am not technically skilled but trying. Thank you

Relevant dmesg messages:
Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS.
GPU HANG: ecode 9:0:0x85dffffb, in cinnamon [1735], reason: Hang on rcs0, action: reset
i915 0000:00:02.0: Resetting rcs0 after gpu hang
Comment 1 Jani Saarinen 2018-04-26 06:57:48 UTC
Could you provide a dmesg log booting with drm.debug=0xe?
Please also tezst latest drm-tip: https://cgit.freedesktop.org/drm-tip?
Comment 2 Mika Kuoppala 2018-04-27 10:37:45 UTC
Bad count in PIPE_CONTROL
0x00002128:      0x7a000004: PIPE_CONTROL: no write, no depth stall, no RC write flush, no inst flush
0x0000212c:      0x00100200:    destination address
0x00002130:      0x00000000:    immediate dword low
0x00002134:      0x00000000:    immediate dword high
0x00002140:      0x05000000: MI_BATCH_BUFFER_END

Stuck in a final flush of the batch submitted by cinnamon, so 
the batch in question is a suspect.
Comment 3 Lionel Landwerlin 2018-04-27 11:12:33 UTC
I think this pipe control was added by :

commit e7ecc5e1600a9463f3f2fff9a9cdaa35c2f68c04
Author: Rafael Antognolli <rafael.antognolli@intel.com>
Date:   Thu Jan 25 17:14:47 2018 -0800

    i965: Emit PIPE_CONTROL with ISP bit on older platforms.
    Emit it on all platforms since gen7.
Comment 4 Lionel Landwerlin 2018-04-27 11:19:12 UTC
The documentation seems to imply that we should have a post sync operation in this pipe control :

"Indirect State Pointers Disable

At the completion of the post-sync operation associated with this pipe
control packet, the indirect state pointers in the hardware are
considered invalid; the indirect pointers are not saved in the context.
If any new indirect state commands are executed in the command stream
while the pipe control is pending, the new indirect state commands are

The ROW_INSTDONE/SAMPLER_INSTDONE values seems to imply that the hardware isn't completely done working so that might be why we hang?
Comment 5 Gary 2018-04-27 18:46:41 UTC
Created attachment 139182 [details]
dmesg output with drm.debug=0xe

I added drm.debug=0xe attaching the dmesg log. I can add a 0xe crashlog when it crashes. In the meantime, I will learn to compile drm-tip. Thank you all for the bug attention.
Comment 6 Gary 2018-04-27 23:02:49 UTC
Created attachment 139183 [details]
dmesg hanging output with drm.debug=0xe

Good news maybe, the laptop is in a hanging mood again. Attached is a crash dmesg with drm.debug=0xe, missed_breadcrumb and hangcheck info. I have not yet learned drm-tip. Now installing Linux 4.17-rc2 and builds of today's git checkouts of libdrm and mesa.
Comment 7 Gary 2018-05-01 02:54:32 UTC
Created attachment 139242 [details]
drm-tip dmesg & error with drm.debug=0xe

Attaching dmesg & error using drm-tip kernel (Ubuntu mainline build April 18). Installing later builds failed. This hang was less immidiate, the screen flickered more. I will continue to update.
Comment 8 Lionel Landwerlin 2018-05-01 13:01:40 UTC
Based on the error you're seeing in /sys, I've come up with the attached patch.
It seems to work on our CI, but since we haven't detected this before, we probably have a gap.
Is there any way you could give it a try?

Thanks a lot.
Comment 9 Lionel Landwerlin 2018-05-01 13:02:04 UTC
Created attachment 139247 [details] [review]
i965: require post sync operation priot to ISP disable
Comment 10 Lionel Landwerlin 2018-05-01 13:47:43 UTC
Created attachment 139248 [details] [review]
i965: require post sync operation priot to ISP disable

Apologies, I run it through the CI then made a small change and screwed up...
Comment 11 Lionel Landwerlin 2018-05-09 19:23:26 UTC
Pushed a fix in https://cgit.freedesktop.org/mesa/mesa/commit/?id=f536097f67521180dafd270b28ac9a852af9c141 :

commit f536097f67521180dafd270b28ac9a852af9c141
Author: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Date:   Tue May 1 12:32:45 2018 +0100

    i965: require pixel scoreboard stall prior to ISP disable
    Invalidating the indirect state pointers might affect a previously
    scheduled & still running 3DPRIMITIVE (causing page fault). So stall
    on pixel scoreboard before that.

Feel free to reopen if you still see the issue.
Comment 12 Gary 2018-05-11 03:57:13 UTC
Created attachment 139486 [details]
drm-tip dmesg & error with patch 139248

Attached is an error and dmesg with the patch. 

I failed compiling mesa directly (autogen.sh something about Radeon). I used apt-get to build-dep, download source (patching it), and compile the Oibaf (Git) package for mesa. I installed the resulting deb files. I could have done it incorrectly, and look forward to the patch in mesa. I'm curious if this could be a hardware issue.
Comment 13 Lionel Landwerlin 2018-05-11 08:42:58 UTC
Thanks a lot for the new error state.
I can tell from the error state that you applied the patch attached.

Could try with the master branch of mesa?
The patch we ended up pushing there is slightly different than the one I attached on this bug.

Regarding your question about the hardware issue, the error state indicate that it's a pagefault error which we're pretty sure is a programming issue.
I think this is because we don't synchronize the invalidation of a set of pointers with the previous rendering.
Comment 14 Gary 2018-05-15 23:49:45 UTC
Created attachment 139583 [details]
Hang with v4 patch

New mesa version installed with the v4 patch, and hang. I tried re-patching and failed because it's already in oibaf's git-based deb files, excellent.

Attaching a dmesg and error log with it. Similar hang, but pixelized noise at the bottom of the display. Also, it didn't reoccur on restart as it usually does. (DP-1 monitor has a corrupt EDID I'm aware of, the hang happens with or without it)
Comment 15 Lionel Landwerlin 2018-05-16 11:02:01 UTC
Interesting. It's not hanging at the end of the batchbuffer anymore.
Looks like we potentially fixed something and now we're running into another issue.

I've installed cinnamon on a Kabylake machine. How do you reproduce this bug mostly? Just using the menu at the bottom left of the screen?
Comment 16 Gary 2018-05-16 16:35:37 UTC
Yes, in Cinnamon (Mint and Fedora), anything that moves hangs quickly, like opening/mouseovering the menu, rightclicking the desktop, or mouseovering the panel-launcher-icons. Yet some boots there is no problem at all.

I installed "multicore cpu monitor", a dynamic/moving bargraph in the panel, makes it hang in seconds if it will hang anyway. Thanks again for the help
Comment 17 Lionel Landwerlin 2018-05-18 13:56:16 UTC
I haven't been able to reproduce over the past few days unfortunately.
Even with the "multicore cpu monitor" applet.
Comment 18 Gary 2018-05-27 06:24:20 UTC
I returned the laptop. The new hang remained irregular, usually coinciding with ethernet not responding, so I suspect a hardware failure. I am closing the bug.

Thank you all for your help, especially Mr. Landwerlin for the patch, patient assistance with it, and testing. The laptop is broken but I'm glad Mesa improved. I buy CPUs with Intel GPUs for Intel's excellent opensource/libre drivers and support and community exemplified here.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.