Bug 84489 - [845G] GPU HANG: ecode 0:0x422b7fc1 - stuck on render ring
Summary: [845G] GPU HANG: ecode 0:0x422b7fc1 - stuck on render ring
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: x86 (IA32) Linux (All)
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-09-29 18:59 UTC by Dmitry Gorbachev
Modified: 2016-10-07 10:44 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
/sys/class/drm/card0/error (2.20 MB, text/plain)
2014-09-29 18:59 UTC, Dmitry Gorbachev
no flags Details
Xorg log (12.76 KB, text/plain)
2014-09-29 19:00 UTC, Dmitry Gorbachev
no flags Details

Description Dmitry Gorbachev 2014-09-29 18:59:26 UTC
Created attachment 107080 [details]
/sys/class/drm/card0/error

Kernel Linux 3.16.3-gnu; distribution Parabola GNU/Linux (xf86-video-intel 2.99.916-2, xorg-server-libre 1.16.0-6, mesa 10.2.8-1, libdrm 2.4.56-1).

From the log:

kernel: Linux agpgart interface v0.103
kernel: agpgart-intel 0000:00:00.0: Intel 845G Chipset
kernel: agpgart-intel 0000:00:00.0: detected gtt size: 131072K total, 131072K mappable
kernel: agpgart-intel 0000:00:00.0: detected 512K stolen memory
kernel: agpgart-intel 0000:00:00.0: AGP aperture is 128M @ 0xe0000000
kernel: [drm] Initialized drm 1.1.0 20060810
kernel: [drm] Memory usable by graphics device = 128M
kernel: [drm] Replacing VGA console driver
kernel: Console: switching to colour dummy device 80x25
kernel: [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
kernel: [drm] Driver supports precise vblank timestamp query.
kernel: i915 0000:00:02.0: BAR 6: can't assign [??? 0x00000000 flags 0x20000000] (bogus alignment)
kernel: [drm] failed to find VBIOS tables
kernel: vgaarb: device changed decodes: PCI:0000:00:02.0,olddecodes=io+mem,decodes=io+mem:owns=io+mem
kernel: [drm] initialized overlay support
kernel: [drm:drm_edid_block_valid] *ERROR* EDID checksum is invalid, remainder is 255
kernel: Raw EDID:
[...]
kernel: [drm] Got external EDID base block and 0 extensions from "edid/edid.bin" for connector "VGA-1"
kernel: i915 0000:00:02.0: fb0: inteldrmfb frame buffer device
kernel: i915 0000:00:02.0: registered panic notifier
kernel: [drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on minor 0

and:

kernel: [drm] stuck on render ring
kernel: [drm] GPU HANG: ecode 0:0x422b7fc1, in Xorg.bin [131], reason: Ring hung, action: reset
kernel: [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
kernel: [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
kernel: [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
kernel: [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
kernel: [drm] GPU crash dump saved to /sys/class/drm/card0/error
kernel: [drm:i915_reset] *ERROR* Failed to reset chip: -19

and then (like bug 82095):

kernel: [drm:i9xx_set_fifo_underrun_reporting] *ERROR* pipe A underrun
[...]
kernel: [drm] GPU HANG: ecode -1:0x00000000, reason: Command parser error, iir 0x00008000, action: continue
kernel: i915: render error detected, EIR: 0x00000010
kernel: [drm:i915_report_and_clear_eir] *ERROR* EIR stuck: 0x00000010, masking
kernel: [drm] GPU HANG: ecode -1:0x00000000, reason: Command parser error, iir 0x00008000, action: continue
kernel: i915: render error detected, EIR: 0x00000010

It happens at random.
Comment 1 Dmitry Gorbachev 2014-09-29 19:00:28 UTC
Created attachment 107081 [details]
Xorg log
Comment 2 Chris Wilson 2014-09-30 06:26:46 UTC
Should be fixed with

commit c4d69da167fa967749aeb70bc0e94a457e5d00c1
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Sep 8 14:25:41 2014 +0100

    drm/i915: Evict CS TLBs between batches
    
    Running igt, I was encountering the invalid TLB bug on my 845g, despite
    that it was using the CS workaround. Examining the w/a buffer in the
    error state, showed that the copy from the user batch into the
    workaround itself was suffering from the invalid TLB bug (the first
    cacheline was broken with the first two words reversed). Time to try a
    fresh approach. This extends the workaround to write into each page of
    our scratch buffer in order to overflow the TLB and evict the invalid
    entries. This could be refined to only do so after we update the GTT,
    but for simplicity, we do it before each batch.
    
    I suspect this supersedes our current workaround, but for safety keep
    doing both.
    
    v2: The magic number shall be 2.
    
    This doesn't conclusively prove that it is the mythical TLB bug we've
    been trying to workaround for so long, that it requires touching a number
    of pages to prevent the corruption indicates to me that it is TLB
    related, but the corruption (the reversed cacheline) is more subtle than
    a TLB bug, where we would expect it to read the wrong page entirely.
    
    Oh well, it prevents a reliable hang for me and so probably for others
    as well.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
    Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
    Cc: stable@vger.kernel.org
    Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
    Signed-off-by: Jani Nikula <jani.nikula@intel.com>

I believe.
Comment 3 Dmitry Gorbachev 2014-11-04 18:35:06 UTC
Likely fixed, closing the report...
Comment 4 Jari Tahvanainen 2016-10-07 10:44:22 UTC
Closing resolved after a year.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.