Summary: | [kbl] GPU HANG: 9:0:0x85dffffb, in Cinnamon | ||
---|---|---|---|
Product: | Mesa | Reporter: | Gary <garyfrombugzilla> |
Component: | Drivers/DRI/i965 | Assignee: | Gary <garyfrombugzilla> |
Status: | RESOLVED WORKSFORME | QA Contact: | Intel 3D Bugs Mailing List <intel-3d-bugs> |
Severity: | critical | ||
Priority: | high | CC: | intel-gfx-bugs, rafael.antognolli |
Version: | unspecified | ||
Hardware: | x86-64 (AMD64) | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | KBL | i915 features: | GPU hang |
Attachments: |
error log
dmesg output with drm.debug=0xe dmesg hanging output with drm.debug=0xe drm-tip dmesg & error with drm.debug=0xe i965: require post sync operation priot to ISP disable i965: require post sync operation priot to ISP disable drm-tip dmesg & error with patch 139248 Hang with v4 patch |
Could you provide a dmesg log booting with drm.debug=0xe? Please also tezst latest drm-tip: https://cgit.freedesktop.org/drm-tip? Bad count in PIPE_CONTROL 0x00002128: 0x7a000004: PIPE_CONTROL: no write, no depth stall, no RC write flush, no inst flush 0x0000212c: 0x00100200: destination address 0x00002130: 0x00000000: immediate dword low 0x00002134: 0x00000000: immediate dword high 0x00002140: 0x05000000: MI_BATCH_BUFFER_END Stuck in a final flush of the batch submitted by cinnamon, so the batch in question is a suspect. I think this pipe control was added by : commit e7ecc5e1600a9463f3f2fff9a9cdaa35c2f68c04 Author: Rafael Antognolli <rafael.antognolli@intel.com> Date: Thu Jan 25 17:14:47 2018 -0800 i965: Emit PIPE_CONTROL with ISP bit on older platforms. Emit it on all platforms since gen7. The documentation seems to imply that we should have a post sync operation in this pipe control : "Indirect State Pointers Disable At the completion of the post-sync operation associated with this pipe control packet, the indirect state pointers in the hardware are considered invalid; the indirect pointers are not saved in the context. If any new indirect state commands are executed in the command stream while the pipe control is pending, the new indirect state commands are preserved. " The ROW_INSTDONE/SAMPLER_INSTDONE values seems to imply that the hardware isn't completely done working so that might be why we hang? Created attachment 139182 [details]
dmesg output with drm.debug=0xe
I added drm.debug=0xe attaching the dmesg log. I can add a 0xe crashlog when it crashes. In the meantime, I will learn to compile drm-tip. Thank you all for the bug attention.
Created attachment 139183 [details]
dmesg hanging output with drm.debug=0xe
Good news maybe, the laptop is in a hanging mood again. Attached is a crash dmesg with drm.debug=0xe, missed_breadcrumb and hangcheck info. I have not yet learned drm-tip. Now installing Linux 4.17-rc2 and builds of today's git checkouts of libdrm and mesa.
Created attachment 139242 [details]
drm-tip dmesg & error with drm.debug=0xe
Attaching dmesg & error using drm-tip kernel (Ubuntu mainline build April 18). Installing later builds failed. This hang was less immidiate, the screen flickered more. I will continue to update.
Based on the error you're seeing in /sys, I've come up with the attached patch. It seems to work on our CI, but since we haven't detected this before, we probably have a gap. Is there any way you could give it a try? Thanks a lot. Created attachment 139247 [details] [review] i965: require post sync operation priot to ISP disable Created attachment 139248 [details] [review] i965: require post sync operation priot to ISP disable Apologies, I run it through the CI then made a small change and screwed up... Pushed a fix in https://cgit.freedesktop.org/mesa/mesa/commit/?id=f536097f67521180dafd270b28ac9a852af9c141 : commit f536097f67521180dafd270b28ac9a852af9c141 Author: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Date: Tue May 1 12:32:45 2018 +0100 i965: require pixel scoreboard stall prior to ISP disable Invalidating the indirect state pointers might affect a previously scheduled & still running 3DPRIMITIVE (causing page fault). So stall on pixel scoreboard before that. Feel free to reopen if you still see the issue. Created attachment 139486 [details]
drm-tip dmesg & error with patch 139248
Attached is an error and dmesg with the patch.
I failed compiling mesa directly (autogen.sh something about Radeon). I used apt-get to build-dep, download source (patching it), and compile the Oibaf (Git) package for mesa. I installed the resulting deb files. I could have done it incorrectly, and look forward to the patch in mesa. I'm curious if this could be a hardware issue.
Thanks a lot for the new error state. I can tell from the error state that you applied the patch attached. Could try with the master branch of mesa? The patch we ended up pushing there is slightly different than the one I attached on this bug. Regarding your question about the hardware issue, the error state indicate that it's a pagefault error which we're pretty sure is a programming issue. I think this is because we don't synchronize the invalidation of a set of pointers with the previous rendering. Created attachment 139583 [details]
Hang with v4 patch
New mesa version installed with the v4 patch, and hang. I tried re-patching and failed because it's already in oibaf's git-based deb files, excellent.
Attaching a dmesg and error log with it. Similar hang, but pixelized noise at the bottom of the display. Also, it didn't reoccur on restart as it usually does. (DP-1 monitor has a corrupt EDID I'm aware of, the hang happens with or without it)
Interesting. It's not hanging at the end of the batchbuffer anymore. Looks like we potentially fixed something and now we're running into another issue. I've installed cinnamon on a Kabylake machine. How do you reproduce this bug mostly? Just using the menu at the bottom left of the screen? Yes, in Cinnamon (Mint and Fedora), anything that moves hangs quickly, like opening/mouseovering the menu, rightclicking the desktop, or mouseovering the panel-launcher-icons. Yet some boots there is no problem at all. I installed "multicore cpu monitor", a dynamic/moving bargraph in the panel, makes it hang in seconds if it will hang anyway. Thanks again for the help I haven't been able to reproduce over the past few days unfortunately. Even with the "multicore cpu monitor" applet. I returned the laptop. The new hang remained irregular, usually coinciding with ethernet not responding, so I suspect a hardware failure. I am closing the bug. Thank you all for your help, especially Mr. Landwerlin for the patch, patient assistance with it, and testing. The laptop is broken but I'm glad Mesa improved. I buy CPUs with Intel GPUs for Intel's excellent opensource/libre drivers and support and community exemplified here. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 139119 [details] error log Occasional constant GPU hangs on Clevo N130WU, even after reboots, unless shut down and leaving it for a while (?). Internal display screen freezes, maybe a black jagged triangle is drawn across top-right of desktop, recovers, then freezes. Intermittent, unreproducable, happens by opening a menu or mouse-over the panel, or using Firefox or MPV in MATE (both use GPU acceleration, MATE doesn't?). In x86_64 Mint 18.3 Cinnamon LiveUSB & Installed (and Fedora 27 Cinnamon LiveUSB). Updated with Linux 4.16.3 and kbl_dmc_ver1_04 and ppa:oibaf/graphics-drivers (Mesa 18.2.0-devel, libdrm 2.4.91, xserver-xorg-video-intel uninstalled, etc). GPU crash dumps attached, unable to dump VBIOS "cat: '/sys/devices/pci0000:00/0000:00:02.0/rom': Input/output error", "stolen memory" not in BIOS (?). I am not technically skilled but trying. Thank you Relevant dmesg messages: Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS. GPU HANG: ecode 9:0:0x85dffffb, in cinnamon [1735], reason: Hang on rcs0, action: reset i915 0000:00:02.0: Resetting rcs0 after gpu hang