Bug 105304

Summary: [BAT] [BSW only] igt@gem_sync@basic-each - GPU hung -> wedged -> WARN_ON(intel_gpu_reset(i915, (~0)))
Product: DRI Reporter: Marta Löfstedt <marta.lofstedt>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: intel-gfx-bugs
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard: ReadyForDev
i915 platform: BSW/CHT i915 features: GPU hang

Description Marta Löfstedt 2018-03-01 06:59:54 UTC
Starting with GPU hang:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3852/fi-bsw-n3050/igt@gem_sync@basic-each.html

(gem_sync:3220) igt-aux-CRITICAL: Test assertion failure function sig_abort, file igt_aux.c:481:
(gem_sync:3220) igt-aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest basic-each failed.

Some skipped tests. 

Then first test getting that the machine is wedged:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3852/fi-bsw-n3050/igt@kms_pipe_crc_basic@hang-read-crc-pipe-c.html

(kms_pipe_crc_basic:3502) igt-gt-CRITICAL: Test assertion failure function igt_force_gpu_reset, file igt_gt.c:407:
(kms_pipe_crc_basic:3502) igt-gt-CRITICAL: Failed assertion: !wedged
Subtest hang-read-crc-pipe-C failed.

Then:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3852/fi-bsw-n3050/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-c.html

[  623.996507] [drm:gen8_reset_engines [i915]] *ERROR* bcs0: reset request timeout
[  623.996779] ------------[ cut here ]------------
[  623.996802] WARN_ON(intel_gpu_reset(i915, (~0)))
[  623.997265] WARNING: CPU: 0 PID: 3558 at drivers/gpu/drm/i915/i915_gem.c:4885 i915_gem_sanitize+0x4d/0x80 [i915]
[  623.997287] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 intel_powerclamp snd_hda_intel coretemp crct10dif_pclmul crc32_pclmul snd_hda_codec ghash_clmulni_intel snd_hwdep snd_hda_core snd_pcm r8169 mii lpc_ich prime_numbers pinctrl_cherryview
[  623.997695] CPU: 0 PID: 3558 Comm: kworker/u4:6 Tainted: G     U  W        4.16.0-rc3-CI-CI_DRM_3852+ #1
[  623.997716] Hardware name:  /NUC5CPYB, BIOS PYBSWCEL.86A.0058.2016.1102.1842 11/02/2016
[  623.997753] Workqueue: events_unbound async_run_entry_fn
[  623.998098] RIP: 0010:i915_gem_sanitize+0x4d/0x80 [i915]
[  623.998119] RSP: 0018:ffffc90000cafcc0 EFLAGS: 00010282
[  623.998158] RAX: 0000000000000024 RBX: ffff880168b40000 RCX: 0000000000000001
[  623.998178] RDX: 0000000080000001 RSI: ffffffff8210fbb9 RDI: 00000000ffffffff
[  623.998199] RBP: ffff880168b40070 R08: ffff880168b6d5c0 R09: 00000000bfaf67a8
[  623.998218] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880168b483d8
[  623.998237] R13: ffffffff814d2580 R14: 0000000000000000 R15: ffffffff820eae31
[  623.998260] FS:  0000000000000000(0000) GS:ffff88017fc00000(0000) knlGS:0000000000000000
[  623.998280] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  623.998301] CR2: 000055bab9bb6d80 CR3: 0000000005210000 CR4: 00000000001006f0
[  623.998320] Call Trace:
[  623.998675]  i915_gem_suspend+0xe4/0x130 [i915]
[  623.998987]  i915_drm_suspend+0x5f/0x160 [i915]
[  623.999038]  ? pci_pm_freeze+0xd0/0xd0
[  623.999063]  pci_pm_suspend+0x74/0x140
[  623.999112]  dpm_run_callback+0x5f/0x310
[  623.999167]  __device_suspend+0xfa/0x5b0
[  623.999215]  ? dpm_watchdog_set+0x60/0x60
[  623.999287]  async_suspend+0x15/0x90
[  623.999323]  async_run_entry_fn+0x2e/0x160
[  623.999371]  process_one_work+0x215/0x620
[  623.999456]  worker_thread+0x48/0x3a0
[  623.999524]  kthread+0xfb/0x130
[  623.999551]  ? process_one_work+0x620/0x620
[  623.999576]  ? _kthread_create_on_node+0x30/0x30
[  623.999626]  ret_from_fork+0x3a/0x50
[  623.999739] Code: 92 03 00 84 c0 74 f1 be ff ff ff ff 48 89 df e8 ba 90 03 00 85 c0 74 e0 48 c7 c6 f8 09 23 a0 48 c7 c7 60 99 21 a0 e8 c3 10 f6 e0 <0f> 0b eb c9 48 8d 6f 70 31 f6 48 89 ef e8 d1 53 7b e1 48 89 df 
[  624.001136] WARNING: CPU: 0 PID: 3558 at drivers/gpu/drm/i915/i915_gem.c:4885 i915_gem_sanitize+0x4d/0x80 [i915]
[  624.001156] ---[ end trace de4e2307708a4c98 ]---
[  624.045104] [drm:vlv_suspend_complete [i915]] *ERROR* timeout waiting for GT wells to go off
[  624.193314] IRQ 115: no longer affine to CPU1
[  624.193360] IRQ 117: no longer affine to CPU1
[  624.193400] IRQ 119: no longer affine to CPU1
[  624.193434] IRQ 120: no longer affine to CPU1
[  624.206873]  cache: parent cpu1 should not be sleeping
[  625.409208] sd 0:0:0:0: [sda] Starting disk
[  625.651818] done.
Comment 1 Marta Löfstedt 2018-03-29 08:28:27 UTC
Last seen: CI_DRM_3852: 2018-02-28 / 224 runs ago
Comment 2 Martin Peres 2018-04-30 07:00:46 UTC
Happened again:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4108/fi-bsw-n3050/igt@gem_sync@basic-each.html

(gem_sync:3107) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:481:
(gem_sync:3107) igt_aux-CRITICAL: Failed assertion: !"GPU hung"


https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4108/fi-bsw-n3050/igt@kms_pipe_crc_basic@hang-read-crc-pipe-c.html

(kms_pipe_crc_basic:3373) igt_gt-CRITICAL: Test assertion failure function igt_force_gpu_reset, file ../lib/igt_gt.c:415:
(kms_pipe_crc_basic:3373) igt_gt-CRITICAL: Failed assertion: !wedged
Subtest hang-read-crc-pipe-C failed.


https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4108/fi-bsw-n3050/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-c.html

[  648.484962] ------------[ cut here ]------------
[  648.484973] WARN_ON(intel_gpu_reset(i915, (~0)))
[  648.485230] WARNING: CPU: 1 PID: 2921 at drivers/gpu/drm/i915/i915_gem.c:4978 i915_gem_sanitize+0x4d/0x80 [i915]
[  648.485242] Modules linked in: vgem btusb btrtl btbcm btintel bluetooth ecdh_generic intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel r8169 mii lpc_ich snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm pinctrl_cherryview prime_numbers
[  648.485501] CPU: 1 PID: 2921 Comm: kworker/u4:31 Tainted: G     U  W         4.17.0-rc2-CI-CI_DRM_4108+ #1
[  648.485512] Hardware name:  /NUC5CPYB, BIOS PYBSWCEL.86A.0058.2016.1102.1842 11/02/2016
[  648.485528] Workqueue: events_unbound async_run_entry_fn
[  648.485716] RIP: 0010:i915_gem_sanitize+0x4d/0x80 [i915]
[  648.485728] RSP: 0018:ffffc90000aebcc8 EFLAGS: 00010286
[  648.485749] RAX: 0000000000000000 RBX: ffff88016c890000 RCX: 0000000000000001
[  648.485760] RDX: 0000000080000001 RSI: ffffffff82121491 RDI: 00000000ffffffff
[  648.485770] RBP: ffff88016c890070 R08: 0000000075a89f58 R09: 0000000000000000
[  648.485781] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88016c8989b8
[  648.485791] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  648.485803] FS:  0000000000000000(0000) GS:ffff88017fd00000(0000) knlGS:0000000000000000
[  648.485814] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  648.485826] CR2: 00007fec90dd73d8 CR3: 0000000005210000 CR4: 00000000001006e0
[  648.485836] Call Trace:
[  648.486029]  i915_gem_suspend+0xec/0x140 [i915]
[  648.486203]  i915_drm_suspend+0x5f/0x160 [i915]
[  648.486231]  pci_pm_suspend+0x7c/0x130
[  648.486252]  ? pci_pm_freeze+0xb0/0xb0
[  648.486267]  dpm_run_callback+0x5d/0x2f0
[  648.486295]  __device_suspend+0x11f/0x600
[  648.486322]  ? dpm_watchdog_set+0x60/0x60
[  648.486364]  async_suspend+0x15/0x90
[  648.486382]  async_run_entry_fn+0x34/0x160
[  648.486408]  process_one_work+0x229/0x6a0
[  648.486453]  worker_thread+0x35/0x380
[  648.486482]  ? process_one_work+0x6a0/0x6a0
[  648.486495]  kthread+0x119/0x130
[  648.486511]  ? _kthread_create_on_node+0x60/0x60
[  648.486537]  ret_from_fork+0x3a/0x50
[  648.486597] Code: dc 03 00 84 c0 74 f1 be ff ff ff ff 48 89 df e8 0a da 03 00 85 c0 74 e0 48 c7 c6 00 f8 20 a0 48 c7 c7 e2 76 1f a0 e8 a3 71 f9 e0 <0f> 0b eb c9 48 8d 6f 70 31 f6 48 89 ef e8 51 d1 83 e1 48 89 df 
[  648.487196] irq event stamp: 1160
[  648.487212] hardirqs last  enabled at (1159): [<ffffffff810f8827>] vprintk_emit+0x4b7/0x4d0
[  648.487226] hardirqs last disabled at (1160): [<ffffffff81a0111c>] error_entry+0x7c/0x100
[  648.487239] softirqs last  enabled at (1132): [<ffffffff81c0032b>] __do_softirq+0x32b/0x4e1
[  648.487254] softirqs last disabled at (1111): [<ffffffff8108b884>] irq_exit+0xa4/0xb0
[  648.487434] WARNING: CPU: 1 PID: 2921 at drivers/gpu/drm/i915/i915_gem.c:4978 i915_gem_sanitize+0x4d/0x80 [i915]
[  648.487445] ---[ end trace 46a07151b792d058 ]---
[  649.548697] [drm:vlv_suspend_complete [i915]] *ERROR* timeout waiting for GT wells to go off
Comment 3 Chris Wilson 2018-04-30 13:08:15 UTC
Come back in 6 months to see if this is gone for good!

commit 5692251c254a3d561316c4e8e10c77e470b60658
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat Apr 28 12:15:32 2018 +0100

    drm/i915/lrc: Scrub the GPU state of the guilty hanging request
    
    Previously, we just reset the ring register in the context image such
    that we could skip over the broken batch and emit the closing
    breadcrumb. However, on resume the context image and GPU state would be
    reloaded, which may have been left in an inconsistent state by the
    reset. The presumption was that at worst it would just cause another
    reset and skip again until it recovered, however it seems just as likely
    to cause an unrecoverable hang. Instead of risking loading an incomplete
    context image, restore it back to the default state.
    
    v2: Fix up off-by-one from including the ppHSWP in with the register
    state.
    v3: Use a ring local to compact a few lines.
    v4: Beware setting the ring local before checking for a NULL request.
Comment 4 Jani Saarinen 2018-06-19 14:25:58 UTC
Closing, thanks.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.