103076 – [v4.13 ARCH] GPU HANG: DMAR: DRHD: handling fault status reg 3 (arch reverted the use of intel_iommu=igfx_off)

Bug 103076 - [v4.13 ARCH] GPU HANG: DMAR: DRHD: handling fault status reg 3 (arch reverted the use of intel_iommu=igfx_off)

Summary: [v4.13 ARCH] GPU HANG: DMAR: DRHD: handling fault status reg 3 (arch reverted...

Status:	CLOSED WORKSFORME

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium major
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Duplicates (12):	102870 102912 103034 103042 103068 103082 103106 103121 103137 103139 103204 103230 (view as bug list)
Depends on:
Blocks:

Reported:	2017-10-03 13:04 UTC by Eric Blau
Modified:	2019-08-21 09:20 UTC (History)
CC List:	22 users (show)

See Also:
i915 platform:	BDW, SKL
i915 features:	GPU hang

Attachments

Description Eric Blau 2017-10-03 13:04:18 UTC

System Architecture: x86_64
Kernel Version:      4.13.3-1-ARCH
Linux Distribution:  Arch Linux
Machine:             MacBook Pro 12,1
Display Connector:   Thunderbolt to DisplayPort (2 external monitors both connected via Thunderbolt, laptop display disabled)

I hit the following GPU hang when logging in to X and starting my normal set of running applications. I never make it more than 5 minutes before the entire X session hangs and requires a hard reboot. I have not been able to capture the error file for the hang for this reason.

The problem occurs reliably and makes Linux 4.13.3 unusable for me. Reverting to 4.12.13 makes the problem go away.

Oct 03 08:34:49 eric-macbookpro kernel: DMAR: DRHD: handling fault status reg 3
Oct 03 08:34:49 eric-macbookpro kernel: DMAR: [DMA Write] Request device [00:02.0] fault addr 2e4a000 [fault reason 23] Unknown
Oct 03 08:35:00 eric-macbookpro kernel: asynchronous wait on fence i915:Xorg[2616]/0:245b timed out
Oct 03 08:35:00 eric-macbookpro kernel: asynchronous wait on fence i915:Xorg[2616]/0:245a timed out
Oct 03 08:35:00 eric-macbookpro kernel: [drm] GPU HANG: ecode 8:0:0x85dffffb, in chromium [6927], reason: Hang on rcs0, action: reset
Oct 03 08:35:00 eric-macbookpro kernel: [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Oct 03 08:35:00 eric-macbookpro kernel: [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Oct 03 08:35:00 eric-macbookpro kernel: [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Oct 03 08:35:00 eric-macbookpro kernel: [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
Oct 03 08:35:00 eric-macbookpro kernel: [drm] GPU crash dump saved to /sys/class/drm/card0/error
Oct 03 08:35:00 eric-macbookpro kernel: drm/i915: Resetting chip after gpu hang
Oct 03 08:35:10 eric-macbookpro kernel: [drm:drm_atomic_helper_swap_state [drm_kms_helper]] *ERROR* [CRTC:46:pipe C] hw_done timed out
Oct 03 08:35:11 eric-macbookpro kernel: asynchronous wait on fence i915:Xorg[2616]/0:2469 timed out
Oct 03 08:35:11 eric-macbookpro kernel: asynchronous wait on fence i915:Xorg[2616]/0:2468 timed out
Oct 03 08:35:14 eric-macbookpro kernel: drm/i915: Resetting chip after gpu hang
Oct 03 08:35:20 eric-macbookpro kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:46:pipe C] hw_done timed out
Oct 03 08:35:21 eric-macbookpro kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:46:pipe C] hw_done timed out
Oct 03 08:35:30 eric-macbookpro kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:46:pipe C] flip_done timed out
Oct 03 08:35:31 eric-macbookpro kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:46:pipe C] flip_done timed out
Oct 03 08:35:31 eric-macbookpro kernel: general protection fault: 0000 [#1] PREEMPT SMP
Oct 03 08:35:31 eric-macbookpro kernel: Modules linked in: fuse cmac bnep snd_hda_codec_hdmi nls_iso8859_1 nls_cp437 vfat fat brcmfmac brcmutil cfg80211 sch_fq_codel iTCO_wdt iTCO_vendor_support mmc_core snd_hda_codec_cirrus snd_hda_codec_generic thunderbolt s
Oct 03 08:35:31 eric-macbookpro kernel:  button facetimehd(O) videobuf2_dma_sg videobuf2_memops videobuf2_v4l2 videobuf2_core videodev media ip_tables x_tables zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) algif_skcipher af_alg hid_apple
Oct 03 08:35:31 eric-macbookpro kernel: CPU: 0 PID: 157 Comm: kworker/u8:4 Tainted: P           O    4.13.3-1-ARCH #1
Oct 03 08:35:31 eric-macbookpro kernel: Hardware name: Apple Inc. MacBookPro12,1/Mac-E43C1C25D4880AD6, BIOS MBP121.88Z.0167.B33.1706181928 06/18/2017
Oct 03 08:35:31 eric-macbookpro kernel: Workqueue: events_unbound intel_atomic_commit_work [i915]
Oct 03 08:35:31 eric-macbookpro kernel: task: ffff893be4ee2d00 task.stack: ffffb4eb01230000
Oct 03 08:35:31 eric-macbookpro kernel: RIP: 0010:__mutex_lock.isra.2+0x33d/0x520
Oct 03 08:35:31 eric-macbookpro kernel: RSP: 0000:ffffb4eb01233bc0 EFLAGS: 00010206
Oct 03 08:35:31 eric-macbookpro kernel: RAX: 260f120f3af4fe00 RBX: ffff893bb31da000 RCX: 0000000000000000
Oct 03 08:35:31 eric-macbookpro kernel: RDX: 260f120f3af4fe07 RSI: ffff893be4ee2d00 RDI: ffff893bb31e09a0
Oct 03 08:35:31 eric-macbookpro kernel: RBP: ffffb4eb01233c60 R08: ffff893b04683200 R09: 0000000000000004
Oct 03 08:35:31 eric-macbookpro kernel: R10: ffffb4eb01233c80 R11: ffffffff8eca248d R12: ffff893bb31d9800
Oct 03 08:35:31 eric-macbookpro kernel: R13: 0000000000000002 R14: ffff893bb31db800 R15: ffff893bb31e09a0
Oct 03 08:35:31 eric-macbookpro kernel: FS:  0000000000000000(0000) GS:ffff893beec00000(0000) knlGS:0000000000000000
Oct 03 08:35:31 eric-macbookpro kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 03 08:35:31 eric-macbookpro kernel: CR2: 000006a1a808f000 CR3: 00000001f436d000 CR4: 00000000003406f0
Oct 03 08:35:31 eric-macbookpro kernel: Call Trace:
Oct 03 08:35:31 eric-macbookpro kernel:  ? vprintk_emit+0x28e/0x300
Oct 03 08:35:31 eric-macbookpro kernel:  __mutex_lock_slowpath+0x13/0x20
Oct 03 08:35:31 eric-macbookpro kernel:  ? __mutex_lock_slowpath+0x13/0x20
Oct 03 08:35:31 eric-macbookpro kernel:  mutex_lock+0x25/0x30
Oct 03 08:35:31 eric-macbookpro kernel:  ilk_initial_watermarks+0x28/0x60 [i915]
Oct 03 08:35:31 eric-macbookpro kernel:  intel_pre_plane_update+0xa8/0x130 [i915]
Oct 03 08:35:31 eric-macbookpro kernel:  intel_update_crtc+0xc1/0xe0 [i915]
Oct 03 08:35:31 eric-macbookpro kernel:  intel_update_crtcs+0x5b/0x80 [i915]
Oct 03 08:35:31 eric-macbookpro kernel:  intel_atomic_commit_tail+0x24b/0xf80 [i915]
Oct 03 08:35:31 eric-macbookpro kernel:  ? ttwu_do_wakeup+0x1e/0x160
Oct 03 08:35:31 eric-macbookpro kernel:  ? try_to_wake_up+0x59/0x450
Oct 03 08:35:31 eric-macbookpro kernel:  intel_atomic_commit_work+0x12/0x20 [i915]
Oct 03 08:35:31 eric-macbookpro kernel:  process_one_work+0x1de/0x430
Oct 03 08:35:31 eric-macbookpro kernel:  worker_thread+0x47/0x3f0
Oct 03 08:35:31 eric-macbookpro kernel:  kthread+0x125/0x140
Oct 03 08:35:31 eric-macbookpro kernel:  ? process_one_work+0x430/0x430
Oct 03 08:35:31 eric-macbookpro kernel:  ? kthread_create_on_node+0x70/0x70
Oct 03 08:35:31 eric-macbookpro kernel:  ret_from_fork+0x25/0x30
Oct 03 08:35:31 eric-macbookpro kernel: Code: 48 89 c2 e9 3b ff ff ff 48 89 d1 48 83 e1 fd 48 89 d0 48 09 f1 f0 49 0f b1 0f 48 39 c2 0f 84 1d fe ff ff 48 89 c2 e9 8e fe ff ff <8b> 50 60 85 d2 74 12 8b 78 64 48 31 c0 0f 1f 40 00 84 c0 0f 84 
Oct 03 08:35:31 eric-macbookpro kernel: RIP: __mutex_lock.isra.2+0x33d/0x520 RSP: ffffb4eb01233bc0
Oct 03 08:35:31 eric-macbookpro kernel: ---[ end trace e67b8bbb2fa05ad5 ]---
Oct 03 08:35:31 eric-macbookpro kernel: note: kworker/u8:4[157] exited with preempt_count 1
Oct 03 08:35:46 eric-macbookpro kernel: [drm:drm_atomic_helper_swap_state [drm_kms_helper]] *ERROR* [CRTC:46:pipe C] hw_done timed out
Oct 03 08:35:56 eric-macbookpro kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:46:pipe C] hw_done timed out
Oct 03 08:36:06 eric-macbookpro kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:46:pipe C] flip_done timed out
Oct 03 08:36:21 eric-macbookpro kernel: [drm:drm_atomic_helper_swap_state [drm_kms_helper]] *ERROR* [CRTC:46:pipe C] hw_done timed out
Oct 03 08:36:31 eric-macbookpro kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:46:pipe C] hw_done timed out
Oct 03 08:36:41 eric-macbookpro kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:46:pipe C] flip_done timed out
Oct 03 08:36:56 eric-macbookpro kernel: [drm:drm_atomic_helper_swap_state [drm_kms_helper]] *ERROR* [CRTC:46:pipe C] hw_done timed out
Oct 03 08:37:06 eric-macbookpro kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:46:pipe C] hw_done timed out
Oct 03 08:37:16 eric-macbookpro kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:46:pipe C] flip_done timed out
Oct 03 08:37:31 eric-macbookpro kernel: [drm:drm_atomic_helper_swap_state [drm_kms_helper]] *ERROR* [CRTC:46:pipe C] hw_done timed out
Oct 03 08:37:41 eric-macbookpro kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:46:pipe C] hw_done timed out
Oct 03 08:37:51 eric-macbookpro kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:46:pipe C] flip_done timed out
Oct 03 08:38:06 eric-macbookpro kernel: [drm:drm_atomic_helper_swap_state [drm_kms_helper]] *ERROR* [CRTC:46:pipe C] hw_done timed out
Oct 03 08:38:16 eric-macbookpro kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:46:pipe C] hw_done timed out
Oct 03 08:38:26 eric-macbookpro kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:46:pipe C] flip_done timed out
Oct 03 08:38:37 eric-macbookpro kernel: [drm:drm_atomic_helper_swap_state [drm_kms_helper]] *ERROR* [CRTC:46:pipe C] hw_done timed out
Oct 03 08:38:47 eric-macbookpro kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:46:pipe C] hw_done timed out
Oct 03 08:38:57 eric-macbookpro kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:46:pipe C] flip_done timed out
Oct 03 08:38:57 eric-macbookpro systemd[1]: Started Getty on tty2.
Oct 03 08:39:07 eric-macbookpro acpid[2423]: client 2616[0:0] has disconnected
Oct 03 08:40:29 eric-macbookpro acpid[2423]: client connected from 2616[0:0]
Oct 03 08:40:29 eric-macbookpro acpid[2423]: 1 client rule loaded
Oct 03 08:40:40 eric-macbookpro kernel: [drm:drm_atomic_helper_swap_state [drm_kms_helper]] *ERROR* [CRTC:46:pipe C] hw_done timed out
Oct 03 08:40:50 eric-macbookpro kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:46:pipe C] hw_done timed out
Oct 03 08:41:00 eric-macbookpro kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:46:pipe C] flip_done timed out
Oct 03 08:41:10 eric-macbookpro kernel: [drm:drm_atomic_helper_swap_state [drm_kms_helper]] *ERROR* [CRTC:46:pipe C] hw_done timed out
Oct 03 08:41:20 eric-macbookpro kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:46:pipe C] hw_done timed out
Oct 03 08:41:30 eric-macbookpro kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:46:pipe C] flip_done timed out

Comment 1 Eric Blau 2017-10-03 13:08:07 UTC

It looks like this may be similar to bug 102870.

Comment 2 Eric Blau 2017-10-03 13:09:10 UTC

Also possibly similar to bug 103068.

Comment 3 Chris Wilson 2017-10-03 13:15:59 UTC

*** Bug 103042 has been marked as a duplicate of this bug. ***

Comment 4 Chris Wilson 2017-10-03 13:16:05 UTC

*** Bug 103034 has been marked as a duplicate of this bug. ***

Comment 5 Chris Wilson 2017-10-03 13:16:41 UTC

DMAR and death is nothing new, see bug 89360. Standard practice is to disable iommu, with intel_iommu=igfx_off.

But the question here is what happened in v4.13 to make it happen for more people?

Comment 6 Chris Wilson 2017-10-03 13:17:04 UTC

*** Bug 102870 has been marked as a duplicate of this bug. ***

Comment 7 Chris Wilson 2017-10-03 13:17:14 UTC

*** Bug 103068 has been marked as a duplicate of this bug. ***

Comment 8 Eric Blau 2017-10-03 13:20:04 UTC

It looks like the Arch Linux kernel maintainer changed the default config option to enable IOMMU:

https://bugs.archlinux.org/task/55629

I will try with the kernel boot option you mention and report back. Thanks.

Comment 9 Eric Blau 2017-10-03 13:55:19 UTC

intel_iommu=igfx_off solves the problem on 4.13.3 for me. Thanks for the suggestion. I get an almost immediate lockup in X without the option.

Comment 10 Chris Wilson 2017-10-03 22:20:08 UTC

*** Bug 103082 has been marked as a duplicate of this bug. ***

Comment 11 Chris Wilson 2017-10-05 09:26:46 UTC

*** Bug 103106 has been marked as a duplicate of this bug. ***

Comment 12 Chris Wilson 2017-10-05 21:27:08 UTC

*** Bug 102912 has been marked as a duplicate of this bug. ***

Comment 13 Chris Wilson 2017-10-06 08:35:26 UTC

*** Bug 103121 has been marked as a duplicate of this bug. ***

Comment 14 Chris Wilson 2017-10-07 16:06:49 UTC

*** Bug 103137 has been marked as a duplicate of this bug. ***

Comment 15 Chris Wilson 2017-10-07 20:30:17 UTC

*** Bug 103139 has been marked as a duplicate of this bug. ***

Comment 16 Ansgar Hegerfeld 2017-10-08 12:51:23 UTC

Maybe we should close this bug as a duplicate of https://bugs.freedesktop.org/show_bug.cgi?id=89360 or did I miss something? The workaround "intel_iommu=igfx_off" works for me using Arch Linux, too.

Comment 17 Chris Wilson 2017-10-11 08:17:03 UTC

*** Bug 103204 has been marked as a duplicate of this bug. ***

Comment 18 Chris Wilson 2017-10-12 07:51:41 UTC

*** Bug 103230 has been marked as a duplicate of this bug. ***

Comment 19 Carsten Mattner 2017-10-12 19:23:14 UTC

(In reply to Ansgar Hegerfeld from comment #16)
> Maybe we should close this bug as a duplicate of
> https://bugs.freedesktop.org/show_bug.cgi?id=89360 or did I miss something?
> The workaround "intel_iommu=igfx_off" works for me using Arch Linux, too.

I don't think the problem is fixed. I'm on Sandybridge and have tested
various kernels and configs.

4.9-LTS    OK

4.4-LTS    KINDA-OK because it has atomic modesetting errors
           that got introduced in 4.2 and haven't been fixed
           until 4.9-LTS (or earlier, don't have kernel in-between
           which is maintained). This would be a nice kernel because
           it will be supported until 2020 IIUC, but the new atomic
           modesetting code is buggier than in 4.9.

4.12       OK but EOL already

4.13       BAD. most problematic drm version:
On my Sandybridge machine I've disabled IOMMU in the BIOS and also added
intel_iommu=igfx_off, on top off the 4.13.5+ kernel having
CONFIG_INTEL_IOMMU_DEFAULT_ON not set anymore. Even though it's harder
to trigger now in 4.13, I can still provoke GPU errors not present in
either 4.12 or 4.9. I've been successfully using 4.9 for more than a
day with heavy GPU and CPU utilization and haven't hit the same errors
as in 4.13.

drm-tip from a week ago   No improvement over 4.13.4

I'm able to hit errors in 4.13 by running ffmpeg to encode a video,
utilizing vaapi for decoding the input stream, using the cpu cores
for encoding, and then starting a second VAAPI client or a browser
with a compositor process like Firefox or Chrome. If I just run
ffmpeg as the sole VAAPI client and no browser or mpv with vaapi
decoding, there are no hangs. The minute I fire up a video to
watch via vaapi and rendering with OpenGL or use Firefox/Chrome,
there's a GPU hang with reset.

Firefox:
[drm] GPU HANG: ecode 6:0:0x80202f7b, in Compositor [2620], reason: Hang on rcs0, action: reset
drm/i915: Resetting chip after gpu hang

Chrome:
drm/i915: Resetting chip after gpu hang
asynchronous wait on fence i915:[global]:a4255 timed out
drm/i915: Resetting chip after gpu hang


Summary: 4.9 stable for days, 4.4 not good, 4.12 good, 4.13 very bad.

Comment 20 Carsten Mattner 2017-10-12 19:24:33 UTC

While I appreciate new features like atomic modesetting or synchronization
fences, the fallout from all the changes has left the drm drivers in a state
of hit and miss. I mean, I would love to use the 4.9 drm drivers in a 4.13
kernel for stability reasons, but it's almost EOL.

Comment 21 Carsten Mattner 2017-10-14 21:17:10 UTC

Built a new drm-tip kernel today and 3d7ee91be487380ef6cad329fafbe424f6885372 is so far looking more promising than 4.13.6 has. But it's too early to declare success. drm-tip from a week ago wasn't this stable. Let's hope I can make it through the weekend without a GPU hang.

Comment 22 Carsten Mattner 2017-10-14 21:26:27 UTC

With that drm-tip kernel I can so far report the following, which isn't a GPU hang, but looks like simple bug:

workqueue: PF_MEMALLOC task 41(khugepaged) is flushing !WQ_MEM_RECLAIM i915-userptr-release: (null)
WARNING: CPU: 3 PID: 41 at kernel/workqueue.c:2440 check_flush_dependency+0xe8/0xf0
[12787.495230] Call Trace:
[12787.495236]  flush_workqueue+0x110/0x3c0
[12787.495242]  ? finish_task_switch+0x70/0x1f0
[12787.495273]  ? i915_gem_userptr_mn_invalidate_range_start+0x13f/0x150 [i915]
[12787.495296]  i915_gem_userptr_mn_invalidate_range_start+0x13f/0x150 [i915]
[12787.495303]  __mmu_notifier_invalidate_range_start+0x4a/0x70
[12787.495307]  try_to_unmap_one+0x715/0x790
[12787.495311]  rmap_walk_file+0xe4/0x230
[12787.495314]  try_to_unmap+0x8e/0xe0
[12787.495317]  ? page_remove_rmap+0x260/0x260
[12787.495319]  ? page_not_mapped+0x10/0x10
[12787.495322]  ? page_get_anon_vma+0x90/0x90
[12787.495325]  migrate_pages+0x6d7/0x9a0
[12787.495329]  ? isolate_freepages_block+0x320/0x320
[12787.495332]  ? __ClearPageMovable+0x10/0x10
[12787.495335]  compact_zone+0x568/0x660
[12787.495337]  compact_zone_order+0x9b/0xc0
[12787.495341]  ? try_to_compact_pages+0xb2/0x220
[12787.495344]  try_to_compact_pages+0xb2/0x220
[12787.495348]  __alloc_pages_direct_compact+0x45/0xe0
[12787.495351]  __alloc_pages_slowpath+0xa66/0xc00
[12787.495354]  ? finish_task_switch+0x70/0x1f0
[12787.495358]  ? del_timer_sync+0x30/0x40
[12787.495361]  ? schedule_timeout+0x177/0x2b0
[12787.495364]  __alloc_pages_nodemask+0x1ab/0x1d0
[12787.495368]  ? wait_woken+0x80/0x80
[12787.495372]  khugepaged+0x296/0x1770
[12787.495375]  ? wait_woken+0x80/0x80
[12787.495379]  ? collapse_shmem.isra.39+0xa30/0xa30
[12787.495381]  kthread+0x10d/0x130
[12787.495384]  ? kthread_create_on_node+0x60/0x60
[12787.495387]  ret_from_fork+0x22/0x30

Comment 23 Carsten Mattner 2017-10-14 21:55:00 UTC

Still no hang, but another different error:

[14485.810561] [drm:intel_pipe_update_end [i915]] *ERROR* Atomic update failure on pipe A (start=733534 end=733535) time 208 us, min 763, max 767, scanline start 761, end 771

Happened when mpv was finishing playing a video.

Comment 24 Carsten Mattner 2017-10-14 22:56:43 UTC

Finally caught one. Took longer than usual in 4.13+, but it's the same Firefox compositor process as before:

[18270.319058] [drm] GPU HANG: ecode 6:0:0x80203f7b, in Compositor [26396], reason: Hang on rcs0, action: reset

It always seems to take a while until the right conditions are met.

Back to 4.9 again until I build another drm-tip snapshot.

Comment 25 Carsten Mattner 2017-10-15 15:54:26 UTC

Been trying to provoke it under two different Wayland compositors for several hours (same workload as yesterday), and it seems much harder to trigger there (same drm-tip kernel).

Comment 26 Carsten Mattner 2017-10-15 15:55:49 UTC

Is this regression related to https://bugs.freedesktop.org/show_bug.cgi?id=101237?

Comment 27 Carsten Mattner 2017-10-15 16:34:05 UTC

(In reply to Carsten Mattner from comment #25)
> Been trying to provoke it under two different Wayland compositors for
> several hours (same workload as yesterday), and it seems much harder to
> trigger there (same drm-tip kernel).

After Wayland taking too long to trigger, exited and entered Xorg.
Didn't take long before I hit

workqueue: PF_MEMALLOC task 41(khugepaged) is flushing !WQ_MEM_RECLAIM i915-userptr-release: (NULL)

I suppose if I continue this session, it will repeat last nights events
and finally RCS0 hang and reset the GPU. This looks like a pattern to me.

Comment 28 Carsten Mattner 2017-10-15 19:05:35 UTC

(In reply to Carsten Mattner from comment #25)
> Been trying to provoke it under two different Wayland compositors for
> several hours (same workload as yesterday), and it seems much harder to
> trigger there (same drm-tip kernel).

Despite the familiar PF_MEMALLOC fault, after having run Wayland compositor
for 3+ hours, and having switched to Xorg after that, no reboot inbetween,
I still haven't hit the hang yet. Uneducated speculation would be that using
Wayland first after boot put the stack in a more forgiving state and increased
the time and operations needed for it to trigger. Wayland was run natively
via its drm backend.

I'll reboot soon for other reasons, but if Wayland manages to hide the drm
regression, I might have to use it as the main daily driver, although there's
no drop-in replacement for my usual X11 window manager environment (yet).

Comment 29 Carsten Mattner 2017-10-16 17:19:25 UTC

4.9 has been the most stable DRM stack, since 4.4.92 can also be made to
hang the GPU with the workload, as I noticed running it for hours today.
Haven't seen a single hang with 4.9 yet. Too bad 4.4 will be Extended-LTS
while 4.9 will be EOL soon. The concurrent use of VAAPI seems to trip
up things eventually with kernels <4.9 and >4.9 but not 4.9.

Comment 30 Carsten Mattner 2017-10-17 01:01:02 UTC

Testing drm-tip commit ba1af442e4884a1148422a7f92ae2f978cfb26a1

Comment 31 Carsten Mattner 2017-10-18 09:30:31 UTC

With drm-tip ba1af442e4884a1148422a7f92ae2f978cfb26a1 it took 8 hours before
the hangs happened. I managed to have ffmpeg and mpv be reported as the
processes that causes RCS0 hangs, both utilizing VAAPI, but once one hang
happened, anything (Firefox, Chrome) provokes the hang until restarting
the DRM stack (aka kernel restart).

4.9.56 still hang free with same workload and more than 8 hours, as
before.

Comment 32 Ivan Linty 2017-10-20 13:44:09 UTC

Hi to all

I'm new to this mailing list, I'm a Linux user from year 2000.

I can confirm, this bug!

My system is opensuse tumbleweed 20171010. 
Kernel is 4.13.5. My gpu driver is Intel i915. After kernel update I have sudden random crash. System is completely stuck! No disk activity, no network activity, no report in systemd journal!
The crash happens after some time , 5 minutes , 30 minutes randomly, but the BUG is very Borring!

A fix is to pass nomodeset.
Another fix is revert to kernel 4.9 (I have installed 4.9.54-2-pf). With this kernel no more hangs, and chrome with web google earth works like a charm!!! ;-)


intel_iommu=igfx_off no works for me!

Kind Regards

Comment 33 Carsten Mattner 2017-10-21 15:27:30 UTC

So after running the same workload which happened to provoke the hangs with 4.4, 4.13 and 4.14-drm-tip, I've been trying to get it to hang with 4.9.

Multiple days and still same good result with the state of DRM in 4.9.56.

The quickest was with 4.13, not even needing an hour before the workload described above triggers the bugs.

4.14-drm-tip takes anywhere from 2 to 8 hours before the GPU hangs.

So I agree with Ivan, 4.9 has the most stable DRM right now and is sadly not the extended LTS release, so either 4.4 needs a backport of 4.9 DRM or 4.14 fixes for the regressions. Not sure what is more likely to happen.

Comment 34 Carsten Mattner 2017-10-21 19:58:39 UTC

Testing drm-tip 9dd506b9e3b79799503694e9c1bb5aba0d7d72eb

Comment 35 Carsten Mattner 2017-10-23 00:24:18 UTC

drm-tip 9dd506b9e3b79799503694e9c1bb5aba0d7d72eb same as before

Comment 36 Ivan Linty 2017-10-23 13:36:40 UTC

My system with:
kernel 4.13.5-1-default
and 
plymouth.enable=0 i915.semaphores=1 i915.enable_rc6=0 i915.enable_psr=0 intel_iommu=igfx_off

looks pretty stable... :-)

Comment 37 Carsten Mattner 2017-10-24 19:45:15 UTC

Ivan, the explicit, non-default options make no difference with 4.13.9 on Sandybridge. Still hangs after a few hours of VAAPI use.

Comment 38 Carsten Mattner 2017-10-24 19:47:07 UTC

Ivan, how extensively have you tested it? I can run CPU/GPU load for days with 4.9.57, but 4.13 and 4.14-drm-tip will hang eventually. And once it hangs for the the first time, the GPU stack is in a state of easy hangs repeated until kernel restart.

Comment 39 Ivan Linty 2017-10-29 07:37:59 UTC

Carsten, not sure we are speaking about same issue at this point...

Comment 40 Carsten Mattner 2017-10-30 00:56:16 UTC

(In reply to Ivan Linty from comment #39)
> Carsten, not sure we are speaking about same issue at this point...

I'm talking about the GPU hangs. If I understand you correctly, you say that with those kernel flags you can't provoke hangs anymore.

My observation is that 4.13 and newer is still susceptible if you test long enough with a mix of CPU and GPU use. The original, immediate hang reported by Arch Linux users is fixed by not enabling DMAR by default, but I cannot make 4.9.59 GPU hang no matter how hard I try. With 4.13 (no DMAR) and newer it's easy, but takes a little time.

Comment 41 Carsten Mattner 2017-11-03 01:16:02 UTC

Another 4.13 regression: https://github.com/mpv-player/mpv/issues/5043

Comment 42 Carsten Mattner 2017-11-09 18:07:03 UTC

With 4.13.2 entering Xorg and leaving results in a failed atomic flip which then 2/3 of the time makes it impossible to restart the kernel cleanly.

This doesn't happen if a Wayland compositor is used and exited.

Comment 43 Carsten Mattner 2017-11-10 04:20:28 UTC

(In reply to Carsten Mattner from comment #42)
> With 4.13.2 entering Xorg and leaving results in a failed atomic flip which
> then 2/3 of the time makes it impossible to restart the kernel cleanly.
> 
> This doesn't happen if a Wayland compositor is used and exited.

It's this atomic error: "flip_done timed out" when you exit Xorg.

There have been other updates in Arch Linux and if I try hard I can reproduce it on 4.9.61 as well now.

Comment 44 Carsten Mattner 2017-11-11 02:47:23 UTC

(In reply to Carsten Mattner from comment #43)
> (In reply to Carsten Mattner from comment #42)
> > With 4.13.2 entering Xorg and leaving results in a failed atomic flip which
> > then 2/3 of the time makes it impossible to restart the kernel cleanly.
> > 
> > This doesn't happen if a Wayland compositor is used and exited.
> 
> It's this atomic error: "flip_done timed out" when you exit Xorg.
> 
> There have been other updates in Arch Linux and if I try hard I can
> reproduce it on 4.9.61 as well now.

Adding video=SVIDEO-1:d to the kernel cmdline seems to fix the flip_done hang.

Comment 45 Carsten Mattner 2017-11-12 16:18:32 UTC

(In reply to Carsten Mattner from comment #44)
> (In reply to Carsten Mattner from comment #43)
> > (In reply to Carsten Mattner from comment #42)
> > > With 4.13.2 entering Xorg and leaving results in a failed atomic flip which
> > > then 2/3 of the time makes it impossible to restart the kernel cleanly.
> > > 
> > > This doesn't happen if a Wayland compositor is used and exited.
> > 
> > It's this atomic error: "flip_done timed out" when you exit Xorg.
> > 
> > There have been other updates in Arch Linux and if I try hard I can
> > reproduce it on 4.9.61 as well now.
> 
> Adding video=SVIDEO-1:d to the kernel cmdline seems to fix the flip_done
> hang.

Ivan, coming back to your suggestion and explicitly enabling semaphores and disabling framebuffer compression, rc6 sleep mode and (I don't know what it is) psr, in addition to video=SVIDEO-1:d seems to be working better than the other tests so far on 4.13.12.

Still testing this:

video=SVIDEO-1:d plymouth.enable=0 i915.semaphores=1 i915.enable_rc6=0 i915.enable_psr=0 intel_iommu=igfx_off

I don't think plymouth.enable=0 is needed on Arch Linux since I think it's a Red Hat graphical boot system, isn't it? I mean it doesn't hurt and is ignored, but I had to ask.

Comment 46 Carsten Mattner 2017-11-13 00:21:43 UTC

(In reply to Carsten Mattner from comment #45)
> (In reply to Carsten Mattner from comment #44)
> > (In reply to Carsten Mattner from comment #43)
> > > (In reply to Carsten Mattner from comment #42)
> > > > With 4.13.2 entering Xorg and leaving results in a failed atomic flip which
> > > > then 2/3 of the time makes it impossible to restart the kernel cleanly.
> > > > 
> > > > This doesn't happen if a Wayland compositor is used and exited.
> > > 
> > > It's this atomic error: "flip_done timed out" when you exit Xorg.
> > > 
> > > There have been other updates in Arch Linux and if I try hard I can
> > > reproduce it on 4.9.61 as well now.
> > 
> > Adding video=SVIDEO-1:d to the kernel cmdline seems to fix the flip_done
> > hang.
> 
> Ivan, coming back to your suggestion and explicitly enabling semaphores and
> disabling framebuffer compression, rc6 sleep mode and (I don't know what it
> is) psr, in addition to video=SVIDEO-1:d seems to be working better than the
> other tests so far on 4.13.12.
> 
> Still testing this:
> 
> video=SVIDEO-1:d plymouth.enable=0 i915.semaphores=1 i915.enable_rc6=0
> i915.enable_psr=0 intel_iommu=igfx_off
> 
> I don't think plymouth.enable=0 is needed on Arch Linux since I think it's a
> Red Hat graphical boot system, isn't it? I mean it doesn't hurt and is
> ignored, but I had to ask.

It took almost 19 hours, but I was able to provoke the RCS0 hang.
The flags seem to certainly hide the regression(s) well enough
that one might possibly get a work day's worth of use of intel-drm,
if one follows a strict reboot once or twice a day routine.

Comment 47 Tuncer Ayaz 2017-11-13 21:53:06 UTC

Like many Linux/BSD users I have an x220 and I've been following this and similar bug reports closely. I don't have much to add but signed up to confirm that I run into the same problems. It's great that I found this ticket and the boot flags that sorta keep the bugs at bay.

Comment 48 Tuncer Ayaz 2017-11-13 21:53:56 UTC

Is https://bugs.freedesktop.org/show_bug.cgi?id=101237 a duplicate?

Comment 49 Eric Blau 2017-11-21 13:48:22 UTC

I'm seeing different symptoms with this error now. I assume this is still the same underlying issue.

I've been running with semaphores=1 lately, but it does not seem to help. Most of the time I get a hang with similar output to this when resuming from hibernate. Unfortunately I could not capture the error file this time because my laptop was completely unresponsive.

System Architecture: x86_64
Kernel Version:      4.13.12-1-ARCH
Linux Distribution:  Arch Linux
Machine:             MacBook Pro 12,1
Display Connector:   Thunderbolt to DisplayPort (2 external monitors both connected via Thunderbolt, laptop display disabled)

Nov 21 08:30:09 eric-macbookpro kernel: asynchronous wait on fence i915:Xorg[3192]/0:cc435 timed out
Nov 21 08:30:11 eric-macbookpro kernel: [drm] GPU HANG: ecode 8:0:0xda91d857, in slack [10507], reason: Hang on rcs0, action: reset
Nov 21 08:30:11 eric-macbookpro kernel: [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Nov 21 08:30:11 eric-macbookpro kernel: [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Nov 21 08:30:11 eric-macbookpro kernel: [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Nov 21 08:30:11 eric-macbookpro kernel: [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
Nov 21 08:30:11 eric-macbookpro kernel: [drm] GPU crash dump saved to /sys/class/drm/card0/error
Nov 21 08:30:11 eric-macbookpro kernel: drm/i915: Resetting chip after gpu hang
Nov 21 08:30:17 eric-macbookpro kernel: drm/i915: Resetting chip after gpu hang
Nov 21 08:30:17 eric-macbookpro kernel: [drm:i915_reset [i915]] *ERROR* GPU recovery failed
Nov 21 08:30:27 eric-macbookpro kernel: [drm:drm_atomic_helper_swap_state [drm_kms_helper]] *ERROR* [CRTC:32:pipe A] hw_done timed out
Nov 21 08:30:28 eric-macbookpro kernel: asynchronous wait on fence i915:Xorg[3192]/0:cc437 timed out
Nov 21 08:30:37 eric-macbookpro kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:32:pipe A] hw_done timed out
Nov 21 08:30:38 eric-macbookpro kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:32:pipe A] hw_done timed out
Nov 21 08:30:47 eric-macbookpro kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:32:pipe A] flip_done timed out
Nov 21 08:30:48 eric-macbookpro kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:32:pipe A] flip_done timed out
Nov 21 08:30:48 eric-macbookpro kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
Nov 21 08:30:48 eric-macbookpro kernel: IP: __mutex_lock.isra.2+0x203/0x520
Nov 21 08:30:48 eric-macbookpro kernel: PGD 0 
Nov 21 08:30:48 eric-macbookpro kernel: P4D 0 
Nov 21 08:30:48 eric-macbookpro kernel: 
Nov 21 08:30:48 eric-macbookpro kernel: Oops: 0002 [#1] PREEMPT SMP
Nov 21 08:30:48 eric-macbookpro kernel: Modules linked in: brcmfmac brcmutil cfg80211 mmc_core facetimehd(O) videobuf2_dma_sg videobuf2_memops videobuf2_v4l2 videobuf2_core videodev media tun asix usbnet mii libphy rfcomm ipt_MASQUERADE nf_nat_masquerade_ipv4 
Nov 21 08:30:48 eric-macbookpro kernel:  intel_powerclamp coretemp kvm_intel kvm irqbypass i2c_algo_bit intel_cstate drm_kms_helper intel_rapl_perf drm snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm pcspkr snd_timer i2c_i801 intel_pch_thermal mei_m
Nov 21 08:30:48 eric-macbookpro kernel:  [last unloaded: brcmutil]
Nov 21 08:30:48 eric-macbookpro kernel: CPU: 0 PID: 17245 Comm: kworker/u8:73 Tainted: P     U     O    4.13.12-1-ARCH #1
Nov 21 08:30:48 eric-macbookpro kernel: Hardware name: Apple Inc. MacBookPro12,1/Mac-E43C1C25D4880AD6, BIOS MBP121.88Z.0167.B33.1706181928 06/18/2017
Nov 21 08:30:48 eric-macbookpro kernel: Workqueue: events_unbound intel_atomic_commit_work [i915]
Nov 21 08:30:48 eric-macbookpro kernel: task: ffff9a2a1eca6900 task.stack: ffffb7848560c000
Nov 21 08:30:48 eric-macbookpro kernel: RIP: 0010:__mutex_lock.isra.2+0x203/0x520
Nov 21 08:30:48 eric-macbookpro kernel: RSP: 0018:ffffb7848560fbd0 EFLAGS: 00010212
Nov 21 08:30:48 eric-macbookpro kernel: RAX: 0000000000000004 RBX: ffff9a2a1eca6900 RCX: 0000000000000002
Nov 21 08:30:48 eric-macbookpro kernel: RDX: ffff9a295a2e4198 RSI: ffff9a2a1eca6900 RDI: ffff9a295a2e41b0
Nov 21 08:30:48 eric-macbookpro kernel: RBP: ffffb7848560fc70 R08: 0000000000000022 R09: ffff9a2a223fe9c0
Nov 21 08:30:48 eric-macbookpro kernel: R10: 0000000000000210 R11: 0000000000000207 R12: ffffb7848560fc10
Nov 21 08:30:48 eric-macbookpro kernel: R13: 0000000000000002 R14: ffff9a295a2df000 R15: ffff9a295a2e41a0
Nov 21 08:30:48 eric-macbookpro kernel: FS:  0000000000000000(0000) GS:ffff9a2a2ec00000(0000) knlGS:0000000000000000
Nov 21 08:30:48 eric-macbookpro kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 21 08:30:48 eric-macbookpro kernel: CR2: 0000000000000004 CR3: 0000000257a09000 CR4: 00000000003406f0
Nov 21 08:30:48 eric-macbookpro kernel: Call Trace:
Nov 21 08:30:48 eric-macbookpro kernel:  ? gen8_write32+0x104/0x260 [i915]
Nov 21 08:30:48 eric-macbookpro kernel:  __mutex_lock_slowpath+0x13/0x20
Nov 21 08:30:48 eric-macbookpro kernel:  ? __mutex_lock_slowpath+0x13/0x20
Nov 21 08:30:48 eric-macbookpro kernel:  mutex_lock+0x25/0x30
Nov 21 08:30:48 eric-macbookpro kernel:  ilk_initial_watermarks+0x28/0x120 [i915]
Nov 21 08:30:48 eric-macbookpro kernel:  intel_pre_plane_update+0xa8/0x130 [i915]
Nov 21 08:30:48 eric-macbookpro kernel:  intel_update_crtc+0xc1/0xe0 [i915]
Nov 21 08:30:48 eric-macbookpro kernel:  intel_update_crtcs+0x5b/0x80 [i915]
Nov 21 08:30:48 eric-macbookpro kernel:  intel_atomic_commit_tail+0x24b/0xf80 [i915]
Nov 21 08:30:48 eric-macbookpro kernel:  ? dequeue_task_fair+0x49f/0x640
Nov 21 08:30:48 eric-macbookpro kernel:  ? __switch_to+0x1fc/0x4d0
Nov 21 08:30:48 eric-macbookpro kernel:  ? finish_task_switch+0x75/0x200
Nov 21 08:30:48 eric-macbookpro kernel:  intel_atomic_commit_work+0x12/0x20 [i915]
Nov 21 08:30:48 eric-macbookpro kernel:  process_one_work+0x1de/0x430
Nov 21 08:30:48 eric-macbookpro kernel:  worker_thread+0x48/0x400
Nov 21 08:30:48 eric-macbookpro kernel:  kthread+0x125/0x140
Nov 21 08:30:48 eric-macbookpro kernel:  ? process_one_work+0x430/0x430
Nov 21 08:30:48 eric-macbookpro kernel:  ? kthread_create_on_node+0x70/0x70
Nov 21 08:30:48 eric-macbookpro kernel:  ret_from_fork+0x25/0x30
Nov 21 08:30:48 eric-macbookpro kernel: Code: 48 39 c6 0f 84 c1 02 00 00 49 8d 47 10 4c 8d 65 a0 48 89 c7 48 89 85 70 ff ff ff 49 8b 47 18 48 89 7d a0 4d 89 67 18 48 89 45 a8 <4c> 89 20 65 48 8b 04 25 00 d3 00 00 4d 39 67 10 48 89 45 b0 0f 
Nov 21 08:30:48 eric-macbookpro kernel: RIP: __mutex_lock.isra.2+0x203/0x520 RSP: ffffb7848560fbd0
Nov 21 08:30:48 eric-macbookpro kernel: CR2: 0000000000000004
Nov 21 08:30:48 eric-macbookpro kernel: ---[ end trace 7df4d0d92d1ba7c4 ]---
Nov 21 08:30:48 eric-macbookpro kernel: note: kworker/u8:73[17245] exited with preempt_count 2
Nov 21 08:30:57 eric-macbookpro kernel: [drm:drm_atomic_helper_swap_state [drm_kms_helper]] *ERROR* [CRTC:32:pipe A] hw_done timed out
Nov 21 08:31:07 eric-macbookpro kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:32:pipe A] hw_done timed out
Nov 21 08:31:17 eric-macbookpro kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:32:pipe A] flip_done timed out
Nov 21 08:31:27 eric-macbookpro kernel: [drm:drm_atomic_helper_swap_state [drm_kms_helper]] *ERROR* [CRTC:32:pipe A] hw_done timed out
Nov 21 08:31:37 eric-macbookpro kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:32:pipe A] hw_done timed out
Nov 21 08:31:47 eric-macbookpro kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:32:pipe A] flip_done timed out

Comment 50 like.the.23 2017-11-30 09:11:39 UTC

Hi,

i found a way to reliable trigger the problem on my archlinux machine. when i test code inside a vagrant box and play a youtube video in chrome, it hangs 99% of the time.

E.g: acceptance tests for puppet code like https://github.com/puppetlabs/puppetlabs-apache

Comment 51 Jani Saarinen 2018-03-29 07:11:30 UTC

First of all. Sorry about spam.
This is mass update for our bugs. 

Sorry if you feel this annoying but with this trying to understand if bug still valid or not.
If bug investigation still in progress, please ignore this and I apologize!

If you think this is not anymore valid, please comment to the bug that can be closed.
If you haven't tested with our latest pre-upstream tree(drm-tip), can you do that also to see if issue is valid there still and if you cannot see issue there, please comment to the bug.

Comment 52 Jani Saarinen 2018-04-25 08:13:47 UTC

Just trying to understand if still valid.
Closing, please re-open is issue still exists.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.

aetf
andyrtr
bbykov_1989
camtech075
cfernandez
crazymanjinn
ganadist
geromanas
innykto
intel-gfx-bugs
jkt
jonathan
like.the.23
linux
martin.stiborsky
mattkrll
pablo.doramas
pmenzel+bugs.freedesktop.org
quejacq
sgh
throwaway19587
vasyl.demin