112341 – [cfl iommu] GPU Hang with Gallium3D iris driver on FC31

Bug 112341 - [cfl iommu] GPU Hang with Gallium3D iris driver on FC31

Summary: [cfl iommu] GPU Hang with Gallium3D iris driver on FC31

Status:	RESOLVED MOVED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	DRI git
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	not set normal
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	Triaged
Keywords:

Depends on:
Blocks:

Reported:	2019-11-19 17:49 UTC by ryan
Modified:	2019-11-29 19:50 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:	CFL
i915 features:	GPU hang

Attachments
GPU crash dump (21.00 KB, text/plain) 2019-11-19 17:49 UTC, ryan	no flags	Details
Another GPU hang crash output (23.67 KB, text/plain) 2019-11-21 11:44 UTC, ryan	no flags	Details
View All

Description ryan 2019-11-19 17:49:51 UTC

Created attachment 146000 [details]
GPU crash dump

As per dmesg:

[ 1156.640672] DMAR: DRHD: handling fault status reg 3
[ 1156.640677] DMAR: [DMA Write] Request device [00:02.0] PASID ffffffff fault addr fffffffeffef6000 [fault reason 07] Next page table ptr is invalid
[ 1162.782665] i915 0000:00:02.0: GPU HANG: ecode 9:1:0x85dffffb, in alacritty [4420], hang on rcs0
[ 1162.782667] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 1162.782667] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 1162.782667] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 1162.782667] The GPU crash dump is required to analyze GPU hangs, so please always attach it.
[ 1162.782668] GPU crash dump saved to /sys/class/drm/card0/error
[ 1162.783676] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0

ThinkPad X1Y4 (i7-8665U/UHD 620 with kernel 5.4-rc8, MESA 19.3-rc3 and MESA_LOADER_DRIVER_OVERRIDE=iris

Crash dump attached.

Comment 1 Chris Wilson 2019-11-19 18:47:57 UTC

The DMAR error right before is definitely suspect. The odd part with DMAR is that there are no unmapped GPU addresses...

Comment 2 Lakshmi 2019-11-20 10:28:14 UTC

rcs0 command stream:
  IDLE?: no
  START: 0x0615c000
  HEAD:  0x00201e38 [0x00001de0]
  TAIL:  0x00001e90 [0x00001e40, 0x00001e90]
  CTL:   0x00003001
  MODE:  0x00000000
  HWS:   0xffffe000
  ACTHD: 0x0000fffe fffd7b0c
  IPEIR: 0x00000000
  IPEHR: 0x7a000004
  INSTDONE: 0xffdfffff
  SC_INSTDONE: 0xffffff90
  SAMPLER_INSTDONE[0][0]: 0xffffffff
  SAMPLER_INSTDONE[0][1]: 0xffffffff
  SAMPLER_INSTDONE[0][2]: 0xffffffff
  ROW_INSTDONE[0][0]: 0xfe10ffbc
  ROW_INSTDONE[0][1]: 0xfe10ffbc
  ROW_INSTDONE[0][2]: 0xfe10ffbc
  batch: [0x0000fffe_fffd7000, 0x0000fffe_fffeb000]
  BBADDR: 0x0000fffe_fffd7b0d
  BB_STATE: 0x00000020
  INSTPS: 0x00009080
  INSTPM: 0x00000000
  FADDR: 0x0000fffe fffd7d00
  RC PSMI: 0x00000010
  FAULT_REG: 0x00000000
  GFX_MODE: 0x00008000
  PDP0: 0x000000006d572000
  PDP1: 0x0000000000000000
  PDP2: 0x0000000000000000
  PDP3: 0x0000000000000000
  ring->head: 0x00001d90
  ring->tail: 0x00001e90
  hangcheck timestamp: 0ms (4295824000; epoch)
  engine reset count: 0
  ELSP[0]:  pid 4420, seqno       52:00000288+, prio 4096, emitted -120ms, start 0615c000, head 00001de0, tail 00001e90
  ELSP[1]:  pid 2184, seqno       10:00004126+, prio 4096, emitted -112ms, start 00005000, head 00001e10, tail 00001eb8
  Active context: alacritty[4420] hw_id 33, prio 0, guilty 0 active 0
rcs0 (submitted by alacritty [4420]) --- gtt_offset = 0x0000fffe fffd7000

Head != Tail != ATCHD based on this, can we conclude it as NOTOURBUG?

Comment 3 Chris Wilson 2019-11-20 10:30:41 UTC

(In reply to Lakshmi from comment #2)
> Head != Tail != ATCHD based on this, can we conclude it as NOTOURBUG?

The DMAR [iommu] fault makes it very much our problem.

Comment 4 ryan 2019-11-21 11:43:10 UTC

Another very similar one. It seems this is happening semi-regularly and has been with the last few 5.4-rc series kernels. I've discovered it always happens during a 5-10 second hang where music will keep playing etc but input and the display becomes unresponsive (makes sense with the GPU hang I guess.

Another log from the latest one for posterity - 

[ 6307.875466] DMAR: DRHD: handling fault status reg 2
[ 6307.875477] DMAR: [DMA Write] Request device [00:02.0] PASID ffffffff fault addr fffffffefffbf000 [fault reason 07] Next page table ptr is invalid
[ 6315.062880] i915 0000:00:02.0: GPU HANG: ecode 9:1:0x85dffffb, in alacritty [14310], hang on rcs0
[ 6315.062884] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 6315.062885] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 6315.062886] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 6315.062887] The GPU crash dump is required to analyze GPU hangs, so please always attach it.
[ 6315.062889] GPU crash dump saved to /sys/class/drm/card0/error
[ 6315.063909] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0

Comment 5 ryan 2019-11-21 11:44:08 UTC

Created attachment 146004 [details]
Another GPU hang crash output

Comment 6 Martin Peres 2019-11-29 19:50:37 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/622.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.