Bug 110985

Summary: [hsw] GPU HANG: ecode 7:1:0xfffffffe in chromium
Product: DRI Reporter: Bernhard Rosenkraenzer <bero>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: RESOLVED DUPLICATE QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: minor    
Priority: medium CC: intel-gfx-bugs
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard: Triaged
i915 platform: HSW i915 features: GPU hang

Description Bernhard Rosenkraenzer 2019-06-24 21:27:25 UTC
I haven't seen any bad effects from this, but dmesg says my GPU froze and was reset.

[Mon Jun 24 02:01:15 2019] i915 0000:00:02.0: GPU HANG: ecode 7:1:0xfffffffe, in Chrome_InProcGp [2585], hang on rcs0
[Mon Jun 24 02:01:15 2019] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[Mon Jun 24 02:01:15 2019] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[Mon Jun 24 02:01:15 2019] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[Mon Jun 24 02:01:15 2019] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[Mon Jun 24 02:01:15 2019] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[Mon Jun 24 02:01:15 2019] i915 0000:00:02.0: Resetting chip for hang on rcs0

/sys/class/drm/card0/error says:
GPU HANG: ecode 7:1:0xfffffffe, in Chrome_InProcGp [2585], hang on rcs0
Kernel: 5.2.0-desktop-0.rc4.1omv4000 x86_64
Time: 1561334476 s 640126 us
Boottime: 797836 s 857659 us
Uptime: 797835 s 261171 us
Epoch: 5092498048 jiffies (1000 HZ)
Capture: 5092504000 jiffies; 76939406 ms ago, 5952 ms after epoch
Active process (on ring rcs0): Chrome_InProcGp [2585]
Reset count: 0
Suspend count: 0
Platform: HASWELL
Subplatform: 0x0
PCI ID: 0x0412
PCI Revision: 0x06
PCI Subsystem: 1028:05b7
IOMMU enabled?: 0
GT awake: yes
RPM wakelock: yes
PM suspended: no
EIR: 0x00000000
IER: 0xfc080421
GTIER[0]: 0x00401821
PGTBL_ER: 0x00000000
FORCEWAKE: 0x00000001
DERRMR: 0xffffffff
CCID: 0x0412010d
  fence[0] = f83b03b0f035001
  fence[1] = 00000000
  fence[2] = f03403b0e82e001
  fence[3] = e82d03b0e027001
  fence[4] = e02603b0d820001
  fence[5] = 00000000
  fence[6] = 00000000
  fence[7] = 00000000
  fence[8] = 00000000
  fence[9] = 00000000
  fence[10] = 00000000
  fence[11] = 00000000
  fence[12] = 00000000
  fence[13] = 00000000
  fence[14] = 00000000
  fence[15] = 00000000
  fence[16] = 00000000
  fence[17] = 00000000
  fence[18] = 00000000
  fence[19] = 00000000
  fence[20] = 00000000
  fence[21] = 00000000
  fence[22] = 00000000
  fence[23] = 00000000
  fence[24] = 00000000
  fence[25] = 00000000
  fence[26] = 00000000
  fence[27] = 00000000
  fence[28] = 00000000
  fence[29] = 00000000
  fence[30] = 00000000
  fence[31] = 00000000
ERROR: 0x00000000
DONE_REG: 0xffffffff
ERR_INT: 0x00000000
rcs0 command stream:
  IDLE?: no
  START: 0x00301000
  HEAD:  0x42a1f930 [0x0001f828]
  TAIL:  0x0001fac8 [0x0001f8e0, 0x0001f908]
  CTL:   0x0001f001
  MODE:  0x00004000
  HWS:   0x7fffe000
  ACTHD: 0x00000000 42a1f930
  IPEIR: 0x00000000
  IPEHR: 0x0c000000
  INSTDONE: 0xffdfffff
  SC_INSTDONE: 0xffffffff
  SAMPLER_INSTDONE[0][0]: 0xffffffff
  ROW_INSTDONE[0][0]: 0xffffffff
  BBADDR: 0x00000000_00003e70
  BB_STATE: 0x00000000
  INSTPS: 0x8000010b
  INSTPM: 0x00000080
  FADDR: 0x00000000 00320ac0
  RC PSMI: 0x00000010
  FAULT_REG: 0x00000000
  GFX_MODE: 0x00002a00
  PP_DIR_BASE: 0x02380000
  ring->head: 0x0001f800
  ring->tail: 0x0001fac8
  hangcheck timestamp: 0ms (5092498048; epoch)
  engine reset count: 0
  Active context: Chrome_InProcGp[2585] hw_id 0, prio 0, guilty 0 active 0
bcs0 command stream:
  IDLE?: yes
  START: 0x00321000
  HEAD:  0x0de18d58 [0x00000000]
  TAIL:  0x00018d58 [0x00000000, 0x00000000]
  CTL:   0x0001f001
  MODE:  0x00000200
  HWS:   0x7ffec000
  ACTHD: 0x00000000 0de18d58
  IPEIR: 0x00000000
  IPEHR: 0x00000000
  INSTDONE: 0xfffffffe
  BBADDR: 0x00000000_00b6a020
  BB_STATE: 0x00000000
  INSTPS: 0x00000000
  INSTPM: 0x00000000
  FADDR: 0x00000000 00339d58
  RC PSMI: 0x00000010
  FAULT_REG: 0x00000000
  GFX_MODE: 0x00000200
  PP_DIR_BASE: 0x7fde0000
  ring->head: 0x00000000
  ring->tail: 0x00000000
  hangcheck timestamp: -797530752ms (0)
  engine reset count: 0
  Active context: [0] hw_id 0, prio 0, guilty 0 active 0
vcs0 command stream:
  IDLE?: yes
  START: 0x00341000
  HEAD:  0x00000420 [0x00000000]
  TAIL:  0x00000420 [0x00000000, 0x00000000]
  CTL:   0x0001f001
  MODE:  0x00000200
  HWS:   0x7ffeb000
  ACTHD: 0x00000000 00000420
  IPEIR: 0x00000000
  IPEHR: 0x00000000
  INSTDONE: 0xfffffffe
  BBADDR: 0x00000000_00000000
  BB_STATE: 0x00000000
  INSTPS: 0x00000000
  INSTPM: 0x00000000
  FADDR: 0x00000000 00341420
  RC PSMI: 0x00000010
  FAULT_REG: 0x00000000
  GFX_MODE: 0x00000200
  PP_DIR_BASE: 0x7fde0000
  ring->head: 0x00000000
  ring->tail: 0x00000000
  hangcheck timestamp: -797530752ms (0)
  engine reset count: 0
  Active context: [0] hw_id 0, prio 0, guilty 0 active 0
vecs0 command stream:
  IDLE?: yes
  START: 0x00361000
  HEAD:  0x00000420 [0x00000000]
  TAIL:  0x00000420 [0x00000000, 0x00000000]
  CTL:   0x0001f001
  MODE:  0x00000200
  HWS:   0x7ffea000
  ACTHD: 0x00000000 00000420
  IPEIR: 0x00000000
  IPEHR: 0x00000000
  INSTDONE: 0xfffffffe
  BBADDR: 0x00000000_00000000
  BB_STATE: 0x00000000
  INSTPS: 0x00000000
  INSTPM: 0x00000000
  FADDR: 0x00000000 00361420
  RC PSMI: 0x00000010
  FAULT_REG: 0x00000000
  GFX_MODE: 0x00000200
  PP_DIR_BASE: 0x7fde0000
  ring->head: 0x00000000
  ring->tail: 0x00000000
  hangcheck timestamp: -797530752ms (0)
  engine reset count: 0
  Active context: [0] hw_id 0, prio 0, guilty 0 active 0
Active (rcs0) [8]:
    00000000_02707000    16384 3f 00 dirty purgeable LLC
    00000000_005d7000    16384 3e 02 Y dirty LLC
    00000000_00000000     4096 3e 02 dirty LLC
    00000000_00b46000    16384 3f 00 dirty LLC
    00000000_0125c000    32768 3f 00 Y dirty LLC
    00000000_0117d000     4096 3f 00 dirty LLC
    00000000_005e6000    65536 3f 00 dirty LLC
    00000000_00f03000    20480 3f 00 dirty LLC
Pinned (global) [17]:
    00000000_7ffff000     4096 41 00 LLC
    00000000_7fffe000     4096 01 01 purgeable LLC
    00000000_00301000   131072 41 00 LLC
    00000000_7ffed000    69632 01 01 dirty LLC
    00000000_7ffec000     4096 01 01 purgeable LLC
    00000000_00321000   131072 41 00 LLC
    00000000_7ffeb000     4096 01 01 purgeable LLC
    00000000_00341000   131072 41 00 LLC
    00000000_7ffea000     4096 01 01 purgeable LLC
    00000000_00361000   131072 41 00 LLC
    00000000_00381000  8294400 7f 00 dirty uncached
    00000000_03020000    69632 01 01 dirty LLC
    00000000_04120000    69632 01 01 dirty LLC
    00000000_03f00000    69632 01 01 dirty LLC
    00000000_00220000   262144 41 00 uncached
    00000000_0f035000  8388608 7e 00 X dirty uncached (fence: 0)
    00000000_0e027000  8388608 7e 00 X dirty uncached (fence: 3)
rcs0 --- 3 requests
  pid 2585, seqno        6:004e1680+, prio -2147483648, emitted -932ms, start 00301000, head 0001f828, tail 0001f908
  pid 6448, seqno        6:004e1681+, prio -2147483648, emitted -899ms, start 00301000, head 0001f908, tail 0001f9e8
  pid 6448, seqno        6:004e1682, prio -2147483648, emitted -400ms, start 00301000, head 0001f9e8, tail 0001fac8
Num Pipes: 3
PWR_WELL_CTL2: c0000000
Pipe [0]:
  Power: on
  SRC: 077f0437
  STAT: 00000000
Plane [0]:
  CNTR: d9000400
  STRIDE: 00001e00
  SURF: 0f035000
  TILEOFF: 00000000
Cursor [0]:
  CNTR: 05000023
  POS: 01eb020a
  BASE: 00220000
Pipe [1]:
  Power: on
  SRC: 00000000
  STAT: 00000000
Plane [1]:
  CNTR: 00000000
  STRIDE: 00000000
  SURF: 00000000
  TILEOFF: 00000000
Cursor [1]:
  CNTR: 00000000
  POS: 00000000
  BASE: 00000000
Pipe [2]:
  Power: on
  SRC: 00000000
  STAT: 00000000
Plane [2]:
  CNTR: 00000000
  STRIDE: 00000000
  SURF: 00000000
  TILEOFF: 00000000
Cursor [2]:
  CNTR: 00000000
  POS: 00000000
  BASE: 00000000
CPU transcoder: A
  Power: on
  CONF: c0000000
  HTOTAL: 0897077f
  HBLANK: 0897077f
  HSYNC: 080307d7
  VTOTAL: 04640437
  VBLANK: 04640437
  VSYNC: 0440043b
CPU transcoder: B
  Power: on
  CONF: 00000000
  HTOTAL: 00000000
  HBLANK: 00000000
  HSYNC: 00000000
  VTOTAL: 00000000
  VBLANK: 00000000
  VSYNC: 00000000
CPU transcoder: C
  Power: on
  CONF: 00000000
  HTOTAL: 00000000
  HBLANK: 00000000
  HSYNC: 00000000
  VTOTAL: 00000000
  VBLANK: 00000000
  VSYNC: 00000000
CPU transcoder: EDP
  Power: on
  CONF: 00000000
  HTOTAL: 00000000
  HBLANK: 00000000
  HSYNC: 00000000
  VTOTAL: 00000000
  VBLANK: 00000000
  VSYNC: 00000000
is_mobile: no
is_lp: no
is_alpha_support: no
has_64bit_reloc: no
gpu_reset_clobbers_display: no
has_reset_engine: no
has_fpga_dbg: yes
has_guc: no
has_guc_ct: no
has_l3_dpf: yes
has_llc: yes
has_logical_ring_contexts: no
has_logical_ring_elsq: no
has_logical_ring_preemption: no
has_pooled_eu: no
has_rc6: yes
has_rc6p: no
has_runtime_pm: yes
has_snoop: no
has_coherent_ggtt: yes
unfenced_needs_alignment: no
hws_needs_physical: no
cursor_needs_physical: no
has_csr: no
has_ddi: yes
has_dp_mst: yes
has_fbc: yes
has_gmch: no
has_hotplug: yes
has_ipc: no
has_overlay: no
has_psr: yes
overlay_needs_physical: no
supports_tv: no
Has logical contexts? yes
scheduler: 0
slice0: 2 subslice(s) (0x3):
        subslice0: 10 EUs (0x3ff)
        subslice1: 10 EUs (0x3ff)
i915.vbt_firmware=(null)
i915.modeset=-1
i915.lvds_channel_mode=0
i915.panel_use_ssc=-1
i915.vbt_sdvo_panel_type=-1
i915.enable_dc=-1
i915.enable_fbc=0
i915.enable_psr=-1
i915.disable_power_well=1
i915.enable_ips=1
i915.invert_brightness=0
i915.enable_guc=0
i915.guc_log_level=0
i915.guc_firmware_path=(null)
i915.huc_firmware_path=(null)
i915.dmc_firmware_path=(null)
i915.mmio_debug=1
i915.edp_vswing=0
i915.reset=2
i915.inject_load_failure=0
i915.fastboot=-1
i915.alpha_support=yes
i915.enable_hangcheck=yes
i915.prefault_disable=no
i915.load_detect_test=no
i915.force_reset_modeset_test=no
i915.error_capture=yes
i915.disable_display=no
i915.verbose_state_checks=yes
i915.nuclear_pageflip=no
i915.enable_dp_mst=yes
i915.enable_dpcd_backlight=no
i915.enable_gvt=no
Comment 1 Chris Wilson 2019-06-25 15:48:04 UTC
  IPEHR: 0x0c000000 => fault on context load. 

Could be a bad context image, or more likely we've angered the GPU.
Comment 2 Chris Wilson 2019-07-02 19:22:52 UTC
Please test with

commit c84c9029d782a3a0d2a7f0522ecb907314d43e2c
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Apr 19 12:17:47 2019 +0100

    drm/i915/ringbuffer: EMIT_INVALIDATE *before* switch context

heading to v5.1 via stable in the next few weeks.

*** This bug has been marked as a duplicate of bug 111014 ***

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.