Bug 104545 - kernel: [drm] GPU HANG: ecode 9:0:0xfffffffe, reason: Hang on rcs0, action: reset
Summary: kernel: [drm] GPU HANG: ecode 9:0:0xfffffffe, reason: Hang on rcs0, action: r...
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium critical
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-01-09 03:54 UTC by cfr
Modified: 2018-03-02 16:02 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
GPU crash dump from /sys/class/drm/card0/error as requested (16.41 KB, text/plain)
2018-01-09 03:54 UTC, cfr
no flags Details

Description cfr 2018-01-09 03:54:31 UTC
Created attachment 136624 [details]
GPU crash dump from /sys/class/drm/card0/error as requested

I'm filing this as a new bug in accordance with instructions found in the journal from the kernel following a GPU hang.

Symptoms: GPU hangs sometimes when running on battery. Hangs occur only when the laptop is left awake without interaction for a little bit (e.g. while making a cup of tea or popping to the loo). Hangs do not occur predictably, however, and these conditions mostly result in no hang. When a hang does occur, the machine is unresponsive on return and cannot be put to sleep or woken etc. Screen is blank/black. Following hard reset, messages such as the following can be found in the journal:

Ion 09 03:17:16 MyComputer kernel: [drm] GPU HANG: ecode 9:0:0xfffffffe, reason: Hang on rcs0, action: reset
Ion 09 03:17:16 MyComputer kernel: [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Ion 09 03:17:16 MyComputer kernel: [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Ion 09 03:17:16 MyComputer kernel: [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Ion 09 03:17:16 MyComputer kernel: [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
Ion 09 03:17:16 MyComputer kernel: [drm] GPU crash dump saved to /sys/class/drm/card0/error
Ion 09 03:17:16 MyComputer kernel: i915 0000:00:02.0: Resetting rcs0 after gpu hang

The last is repeated many times. As the crash dump does not typically survive a reboot, a dump was collected using

dmesg -w | awk '/GPU crash dump saved to \/sys\/class\/drm\/card0\/error/ {system("cat /sys/class/drm/card0/error | bzip2 > error.bz2")}'

as suggested at https://bbs.archlinux.org/viewtopic.php?pid=1753566#p1753566.

Since the dump is not especially large, I'm attaching the decompressed version. 

I would be happy to provide further information on request, provided I can figure out how to do whatever would be helpful.
Comment 1 Chris Wilson 2018-01-09 09:00:41 UTC
The DMC fw is broken,

commit 4f0aa1fa3e3849caee450ee5d14fcc289cf16703
Author: Anusha Srivatsa <anusha.srivatsa@intel.com>
Date:   Thu Nov 9 10:51:43 2017 -0800

    drm/i915/dmc: DMC 1.04 for Kabylake
    
    There is a new version of DMC available for KBL.
    
    The release notes mentions:
    1. Fix for the issue where DC_STATE was getting enabled even
    when disabled by driver causing data corruption.
    
    v2: Remove pull request from commit message (Rodrigo).
    
    Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
    Signed-off-by: Anusha Srivatsa <anusha.srivatsa@intel.com>
    Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
    Signed-off-by: Jani Nikula <jani.nikula@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/1510253503-12634-1-git-send-email-anusha.srivatsa@intel.com


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.