Bug 84490 - [snb] GPU HANG: ecode 0:0x85fffff8, in kwin [3405], reason: Ring hung, action: reset
Summary: [snb] GPU HANG: ecode 0:0x85fffff8, in kwin [3405], reason: Ring hung, action...
Status: RESOLVED INVALID
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/DRI/i965 (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Ian Romanick
QA Contact: Intel 3D Bugs Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-09-29 20:10 UTC by chnotley
Modified: 2017-02-10 22:38 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
GPU crash dump (2.19 MB, text/plain)
2014-09-29 20:10 UTC, chnotley
Details

Description chnotley 2014-09-29 20:10:45 UTC
Created attachment 107089 [details]
GPU crash dump

I upgraded a stock Debian Wheezy system kernel to a vanilla 3.16.3 downloaded from kernel.org and my system was slow to boot.  My logs showed a 3 minute hang before printing this message:

Sep 29 12:57:16 notleych-linux org.kde.powerdevil.backlighthelper: QDBusConnection: system D-Bus connection created before QCoreApplication. Application may misbehave.
Sep 29 12:57:22 notleych-linux kernel: [  198.748705] [drm] stuck on render ring
Sep 29 12:57:22 notleych-linux kernel: [  198.749260] [drm] GPU HANG: ecode 0:0x85fffff8, in kwin [3405], reason: Ring hung, action: reset
Sep 29 12:57:22 notleych-linux kernel: [  198.749263] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Sep 29 12:57:22 notleych-linux kernel: [  198.749264] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Sep 29 12:57:22 notleych-linux kernel: [  198.749265] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Sep 29 12:57:22 notleych-linux kernel: [  198.749266] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
Sep 29 12:57:22 notleych-linux kernel: [  198.749267] [drm] GPU crash dump saved to /sys/class/drm/card0/error
Sep 29 12:57:22 notleych-linux kernel: [  198.768720] ------------[ cut here ]------------
Sep 29 12:57:22 notleych-linux kernel: [  198.768733] WARNING: CPU: 1 PID: 3008 at drivers/gpu/drm/drm_irq.c:774 send_vblank_event+0x32/0xce [drm]()
Sep 29 12:57:22 notleych-linux kernel: [  198.768734] Modules linked in: des_generic ecb md4 cifs binfmt_misc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc nls_utf8 nls_cp437 vfat fat loop joydev hid_generic usbhid hid x86_pkg_temp_thermal coretemp kvm_intel kvm ghash_clmulni_intel aesni_intel aes_x86_64 ablk_helper cryptd i915 lrw gf128mul glue_helper drm_kms_helper drm iTCO_wdt iTCO_vendor_support ehci_pci ehci_hcd usbcore acpi_cpufreq evdev lpc_ich psmouse usb_common i2c_i801 mfd_core serio_raw processor i2c_algo_bit microcode i2c_core snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic pcspkr dcdbas snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm video button snd_seq snd_timer snd_seq_device parport_pc parport thermal_sys snd soundcore tpm_tis tpm ext4 crc16 jbd2 mbcache raid1 md_mod sg sd_mod crc_t10dif crct10dif_common sr_mod cdrom crc32c_intel ahci libahci libata scsi_mod e1000e ptp pps_core
Sep 29 12:57:22 notleych-linux kernel: [  198.768775] CPU: 1 PID: 3008 Comm: Xorg Not tainted 3.16.3 #1
Sep 29 12:57:22 notleych-linux kernel: [  198.768776] Hardware name: Dell Inc. Precision T1600/06NWYK, BIOS A10 02/21/2012
Sep 29 12:57:22 notleych-linux kernel: [  198.768777]  0000000000000000 0000000000000009 ffffffff8139ae68 0000000000000000
Sep 29 12:57:22 notleych-linux kernel: [  198.768779]  ffffffff8103c193 ffff88030b89f000 ffffffffa032a6f5 ffff88031c85eb70
Sep 29 12:57:22 notleych-linux kernel: [  198.768781]  ffff88002fb0be40 ffff88030bbe7cd8 000000000000027e ffff88031c85e800
Sep 29 12:57:22 notleych-linux kernel: [  198.768783] Call Trace:
Sep 29 12:57:22 notleych-linux kernel: [  198.768788]  [<ffffffff8139ae68>] ? dump_stack+0x41/0x51
Sep 29 12:57:22 notleych-linux kernel: [  198.768792]  [<ffffffff8103c193>] ? warn_slowpath_common+0x78/0x90
Sep 29 12:57:22 notleych-linux kernel: [  198.768797]  [<ffffffffa032a6f5>] ? send_vblank_event+0x32/0xce [drm]
Sep 29 12:57:22 notleych-linux kernel: [  198.768801]  [<ffffffffa032a6f5>] ? send_vblank_event+0x32/0xce [drm]
Sep 29 12:57:22 notleych-linux kernel: [  198.768805]  [<ffffffffa032aa75>] ? drm_send_vblank_event+0x51/0x5a [drm]
Sep 29 12:57:22 notleych-linux kernel: [  198.768818]  [<ffffffffa03a03e4>] ? intel_crtc_page_flip+0x3b2/0x3eb [i915]
Sep 29 12:57:22 notleych-linux kernel: [  198.768824]  [<ffffffffa03351bd>] ? drm_mode_page_flip_ioctl+0x1dc/0x27d [drm]
Sep 29 12:57:22 notleych-linux kernel: [  198.768828]  [<ffffffffa0327fe0>] ? drm_ioctl+0x27a/0x3c0 [drm]
Sep 29 12:57:22 notleych-linux kernel: [  198.768833]  [<ffffffffa0334fe1>] ? drm_mode_gamma_get_ioctl+0xb7/0xb7 [drm]
Sep 29 12:57:22 notleych-linux kernel: [  198.768836]  [<ffffffff811223b8>] ? do_vfs_ioctl+0x3ed/0x436
Sep 29 12:57:22 notleych-linux kernel: [  198.768839]  [<ffffffff8111522a>] ? vfs_read+0xb7/0xf7
Sep 29 12:57:22 notleych-linux kernel: [  198.768841]  [<ffffffff8112244a>] ? SyS_ioctl+0x49/0x77
Sep 29 12:57:22 notleych-linux kernel: [  198.768843]  [<ffffffff8139f312>] ? system_call_fastpath+0x16/0x1b
Sep 29 12:57:22 notleych-linux kernel: [  198.768844] ---[ end trace 4e4656dbeea8452e ]---
Comment 1 Chris Wilson 2014-09-30 06:01:55 UTC
Should be fixed with

commit c4d69da167fa967749aeb70bc0e94a457e5d00c1
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Sep 8 14:25:41 2014 +0100

    drm/i915: Evict CS TLBs between batches
    
    Running igt, I was encountering the invalid TLB bug on my 845g, despite
    that it was using the CS workaround. Examining the w/a buffer in the
    error state, showed that the copy from the user batch into the
    workaround itself was suffering from the invalid TLB bug (the first
    cacheline was broken with the first two words reversed). Time to try a
    fresh approach. This extends the workaround to write into each page of
    our scratch buffer in order to overflow the TLB and evict the invalid
    entries. This could be refined to only do so after we update the GTT,
    but for simplicity, we do it before each batch.
    
    I suspect this supersedes our current workaround, but for safety keep
    doing both.
    
    v2: The magic number shall be 2.
    
    This doesn't conclusively prove that it is the mythical TLB bug we've
    been trying to workaround for so long, that it requires touching a number
    of pages to prevent the corruption indicates to me that it is TLB
    related, but the corruption (the reversed cacheline) is more subtle than
    a TLB bug, where we would expect it to read the wrong page entirely.
    
    Oh well, it prevents a reliable hang for me and so probably for others
    as well.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
    Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
    Cc: stable@vger.kernel.org
    Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
    Signed-off-by: Jani Nikula <jani.nikula@intel.com>

I believe.
Comment 2 Chris Wilson 2014-09-30 06:26:26 UTC
(In reply to comment #1)
> Should be fixed with
> 
> commit c4d69da167fa967749aeb70bc0e94a457e5d00c1
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Mon Sep 8 14:25:41 2014 +0100
> 
>     drm/i915: Evict CS TLBs between batches
>     

Ah, replied to the wrong bug. Sorry.
Comment 3 Annie 2017-02-10 22:38:56 UTC
Dear Reporter,

This Mesa bug has been in the "NEEDINFO" status for over 60 days. I am closing this bug based on lack of response but feel free to reopen if resolution is still needed. Please ensure you're supplying the correct information as requested.

Thank you.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.