Starting with GPU hang: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3852/fi-bsw-n3050/igt@gem_sync@basic-each.html (gem_sync:3220) igt-aux-CRITICAL: Test assertion failure function sig_abort, file igt_aux.c:481: (gem_sync:3220) igt-aux-CRITICAL: Failed assertion: !"GPU hung" Subtest basic-each failed. Some skipped tests. Then first test getting that the machine is wedged: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3852/fi-bsw-n3050/igt@kms_pipe_crc_basic@hang-read-crc-pipe-c.html (kms_pipe_crc_basic:3502) igt-gt-CRITICAL: Test assertion failure function igt_force_gpu_reset, file igt_gt.c:407: (kms_pipe_crc_basic:3502) igt-gt-CRITICAL: Failed assertion: !wedged Subtest hang-read-crc-pipe-C failed. Then: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3852/fi-bsw-n3050/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-c.html [ 623.996507] [drm:gen8_reset_engines [i915]] *ERROR* bcs0: reset request timeout [ 623.996779] ------------[ cut here ]------------ [ 623.996802] WARN_ON(intel_gpu_reset(i915, (~0))) [ 623.997265] WARNING: CPU: 0 PID: 3558 at drivers/gpu/drm/i915/i915_gem.c:4885 i915_gem_sanitize+0x4d/0x80 [i915] [ 623.997287] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 intel_powerclamp snd_hda_intel coretemp crct10dif_pclmul crc32_pclmul snd_hda_codec ghash_clmulni_intel snd_hwdep snd_hda_core snd_pcm r8169 mii lpc_ich prime_numbers pinctrl_cherryview [ 623.997695] CPU: 0 PID: 3558 Comm: kworker/u4:6 Tainted: G U W 4.16.0-rc3-CI-CI_DRM_3852+ #1 [ 623.997716] Hardware name: /NUC5CPYB, BIOS PYBSWCEL.86A.0058.2016.1102.1842 11/02/2016 [ 623.997753] Workqueue: events_unbound async_run_entry_fn [ 623.998098] RIP: 0010:i915_gem_sanitize+0x4d/0x80 [i915] [ 623.998119] RSP: 0018:ffffc90000cafcc0 EFLAGS: 00010282 [ 623.998158] RAX: 0000000000000024 RBX: ffff880168b40000 RCX: 0000000000000001 [ 623.998178] RDX: 0000000080000001 RSI: ffffffff8210fbb9 RDI: 00000000ffffffff [ 623.998199] RBP: ffff880168b40070 R08: ffff880168b6d5c0 R09: 00000000bfaf67a8 [ 623.998218] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880168b483d8 [ 623.998237] R13: ffffffff814d2580 R14: 0000000000000000 R15: ffffffff820eae31 [ 623.998260] FS: 0000000000000000(0000) GS:ffff88017fc00000(0000) knlGS:0000000000000000 [ 623.998280] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 623.998301] CR2: 000055bab9bb6d80 CR3: 0000000005210000 CR4: 00000000001006f0 [ 623.998320] Call Trace: [ 623.998675] i915_gem_suspend+0xe4/0x130 [i915] [ 623.998987] i915_drm_suspend+0x5f/0x160 [i915] [ 623.999038] ? pci_pm_freeze+0xd0/0xd0 [ 623.999063] pci_pm_suspend+0x74/0x140 [ 623.999112] dpm_run_callback+0x5f/0x310 [ 623.999167] __device_suspend+0xfa/0x5b0 [ 623.999215] ? dpm_watchdog_set+0x60/0x60 [ 623.999287] async_suspend+0x15/0x90 [ 623.999323] async_run_entry_fn+0x2e/0x160 [ 623.999371] process_one_work+0x215/0x620 [ 623.999456] worker_thread+0x48/0x3a0 [ 623.999524] kthread+0xfb/0x130 [ 623.999551] ? process_one_work+0x620/0x620 [ 623.999576] ? _kthread_create_on_node+0x30/0x30 [ 623.999626] ret_from_fork+0x3a/0x50 [ 623.999739] Code: 92 03 00 84 c0 74 f1 be ff ff ff ff 48 89 df e8 ba 90 03 00 85 c0 74 e0 48 c7 c6 f8 09 23 a0 48 c7 c7 60 99 21 a0 e8 c3 10 f6 e0 <0f> 0b eb c9 48 8d 6f 70 31 f6 48 89 ef e8 d1 53 7b e1 48 89 df [ 624.001136] WARNING: CPU: 0 PID: 3558 at drivers/gpu/drm/i915/i915_gem.c:4885 i915_gem_sanitize+0x4d/0x80 [i915] [ 624.001156] ---[ end trace de4e2307708a4c98 ]--- [ 624.045104] [drm:vlv_suspend_complete [i915]] *ERROR* timeout waiting for GT wells to go off [ 624.193314] IRQ 115: no longer affine to CPU1 [ 624.193360] IRQ 117: no longer affine to CPU1 [ 624.193400] IRQ 119: no longer affine to CPU1 [ 624.193434] IRQ 120: no longer affine to CPU1 [ 624.206873] cache: parent cpu1 should not be sleeping [ 625.409208] sd 0:0:0:0: [sda] Starting disk [ 625.651818] done.
Last seen: CI_DRM_3852: 2018-02-28 / 224 runs ago
Happened again: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4108/fi-bsw-n3050/igt@gem_sync@basic-each.html (gem_sync:3107) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:481: (gem_sync:3107) igt_aux-CRITICAL: Failed assertion: !"GPU hung" https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4108/fi-bsw-n3050/igt@kms_pipe_crc_basic@hang-read-crc-pipe-c.html (kms_pipe_crc_basic:3373) igt_gt-CRITICAL: Test assertion failure function igt_force_gpu_reset, file ../lib/igt_gt.c:415: (kms_pipe_crc_basic:3373) igt_gt-CRITICAL: Failed assertion: !wedged Subtest hang-read-crc-pipe-C failed. https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4108/fi-bsw-n3050/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-c.html [ 648.484962] ------------[ cut here ]------------ [ 648.484973] WARN_ON(intel_gpu_reset(i915, (~0))) [ 648.485230] WARNING: CPU: 1 PID: 2921 at drivers/gpu/drm/i915/i915_gem.c:4978 i915_gem_sanitize+0x4d/0x80 [i915] [ 648.485242] Modules linked in: vgem btusb btrtl btbcm btintel bluetooth ecdh_generic intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel r8169 mii lpc_ich snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm pinctrl_cherryview prime_numbers [ 648.485501] CPU: 1 PID: 2921 Comm: kworker/u4:31 Tainted: G U W 4.17.0-rc2-CI-CI_DRM_4108+ #1 [ 648.485512] Hardware name: /NUC5CPYB, BIOS PYBSWCEL.86A.0058.2016.1102.1842 11/02/2016 [ 648.485528] Workqueue: events_unbound async_run_entry_fn [ 648.485716] RIP: 0010:i915_gem_sanitize+0x4d/0x80 [i915] [ 648.485728] RSP: 0018:ffffc90000aebcc8 EFLAGS: 00010286 [ 648.485749] RAX: 0000000000000000 RBX: ffff88016c890000 RCX: 0000000000000001 [ 648.485760] RDX: 0000000080000001 RSI: ffffffff82121491 RDI: 00000000ffffffff [ 648.485770] RBP: ffff88016c890070 R08: 0000000075a89f58 R09: 0000000000000000 [ 648.485781] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88016c8989b8 [ 648.485791] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 648.485803] FS: 0000000000000000(0000) GS:ffff88017fd00000(0000) knlGS:0000000000000000 [ 648.485814] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 648.485826] CR2: 00007fec90dd73d8 CR3: 0000000005210000 CR4: 00000000001006e0 [ 648.485836] Call Trace: [ 648.486029] i915_gem_suspend+0xec/0x140 [i915] [ 648.486203] i915_drm_suspend+0x5f/0x160 [i915] [ 648.486231] pci_pm_suspend+0x7c/0x130 [ 648.486252] ? pci_pm_freeze+0xb0/0xb0 [ 648.486267] dpm_run_callback+0x5d/0x2f0 [ 648.486295] __device_suspend+0x11f/0x600 [ 648.486322] ? dpm_watchdog_set+0x60/0x60 [ 648.486364] async_suspend+0x15/0x90 [ 648.486382] async_run_entry_fn+0x34/0x160 [ 648.486408] process_one_work+0x229/0x6a0 [ 648.486453] worker_thread+0x35/0x380 [ 648.486482] ? process_one_work+0x6a0/0x6a0 [ 648.486495] kthread+0x119/0x130 [ 648.486511] ? _kthread_create_on_node+0x60/0x60 [ 648.486537] ret_from_fork+0x3a/0x50 [ 648.486597] Code: dc 03 00 84 c0 74 f1 be ff ff ff ff 48 89 df e8 0a da 03 00 85 c0 74 e0 48 c7 c6 00 f8 20 a0 48 c7 c7 e2 76 1f a0 e8 a3 71 f9 e0 <0f> 0b eb c9 48 8d 6f 70 31 f6 48 89 ef e8 51 d1 83 e1 48 89 df [ 648.487196] irq event stamp: 1160 [ 648.487212] hardirqs last enabled at (1159): [<ffffffff810f8827>] vprintk_emit+0x4b7/0x4d0 [ 648.487226] hardirqs last disabled at (1160): [<ffffffff81a0111c>] error_entry+0x7c/0x100 [ 648.487239] softirqs last enabled at (1132): [<ffffffff81c0032b>] __do_softirq+0x32b/0x4e1 [ 648.487254] softirqs last disabled at (1111): [<ffffffff8108b884>] irq_exit+0xa4/0xb0 [ 648.487434] WARNING: CPU: 1 PID: 2921 at drivers/gpu/drm/i915/i915_gem.c:4978 i915_gem_sanitize+0x4d/0x80 [i915] [ 648.487445] ---[ end trace 46a07151b792d058 ]--- [ 649.548697] [drm:vlv_suspend_complete [i915]] *ERROR* timeout waiting for GT wells to go off
Come back in 6 months to see if this is gone for good! commit 5692251c254a3d561316c4e8e10c77e470b60658 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Sat Apr 28 12:15:32 2018 +0100 drm/i915/lrc: Scrub the GPU state of the guilty hanging request Previously, we just reset the ring register in the context image such that we could skip over the broken batch and emit the closing breadcrumb. However, on resume the context image and GPU state would be reloaded, which may have been left in an inconsistent state by the reset. The presumption was that at worst it would just cause another reset and skip again until it recovered, however it seems just as likely to cause an unrecoverable hang. Instead of risking loading an incomplete context image, restore it back to the default state. v2: Fix up off-by-one from including the ppHSWP in with the register state. v3: Use a ring local to compact a few lines. v4: Beware setting the ring local before checking for a NULL request.
Closing, thanks.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.