Bug 103022

Summary: [BAT] igt@kms_* incomplete softdog
Product: DRI Reporter: Jani Saarinen <jani.saarinen>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED WORKSFORME QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: critical    
Priority: high CC: intel-gfx-bugs
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard: ReadyForDev
i915 platform: CFL i915 features: display/atomic

Description Jani Saarinen 2017-09-28 08:54:36 UTC
ON CI_DRM_3148 fi-cfl-s incompletes on igt@kms_pipe_crc_basic@nonblocking-crc-pipe-c

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3148/fi-cfl-s/igt@kms_pipe_crc_basic@nonblocking-crc-pipe-c.html

Panic exists
<14>[  542.308894] [IGT] kms_pipe_crc_basic: starting subtest nonblocking-crc-pipe-C
<14>[  542.309017] [IGT] kms_pipe_crc_basic: exiting, ret=77
<5>[  549.732721] i915 0000:00:02.0: Resetting bcs0 after gpu hang
<7>[  549.732986] [drm:gen8_init_common_ring [i915]] Execlists enabled for bcs0
<5>[  549.733048] i915 0000:00:02.0: Resetting vcs0 after gpu hang
<7>[  549.733194] [drm:gen8_init_common_ring [i915]] Execlists enabled for vcs0
<5>[  549.733243] i915 0000:00:02.0: Resetting vecs0 after gpu hang
<7>[  549.733468] [drm:gen8_init_common_ring [i915]] Execlists enabled for vecs0
<5>[  551.716895] i915 0000:00:02.0: Resetting rcs0 after gpu hang
<7>[  551.717186] [drm:gen8_init_common_ring [i915]] Execlists enabled for rcs0
<7>[  551.729497] [drm:init_workarounds_ring [i915]] rcs0: Number of context specific w/a: 16
<7>[  553.700605] [drm:missed_breadcrumb [i915]] rcs0 missed breadcrumb at intel_breadcrumbs_hangcheck+0x61/0x80 [i915], irq posted? yes, current seqno=2c32cc, last=2c32cc
<3>[  553.829554] [drm:i915_gem_idle_work_handler [i915]] *ERROR* Timeout waiting for engines to idle
<3>[  564.196642] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [CRTC:36:pipe A] flip_done timed out
<3>[  574.436369] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [CRTC:46:pipe B] flip_done timed out
<2>[  575.414385] softdog: Initiating panic
<0>[  575.414410] Kernel panic - not syncing: Software Watchdog Timer expired
<4>[  575.414436] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G     U  W       4.14.0-rc2-CI-CI_DRM_3148+ #1
<4>[  575.414465] Hardware name: Intel Corporation CoffeeLake Client Platform/CoffeeLake S UDIMM RVP, BIOS CNLSFWR1.R00.X095.A04.1707240838 07/24/2017
<4>[  575.414503] Call Trace:
<4>[  575.414515]  <IRQ>
<4>[  575.414535]  dump_stack+0x68/0x9f
<4>[  575.414552]  panic+0xd4/0x21d
<4>[  575.414570]  ? printk+0x43/0x4b
<4>[  575.414591]  softdog_fire+0x6b/0x70
<4>[  575.414607]  __hrtimer_run_queues+0xed/0x4d0
<4>[  575.414627]  ? softdog_ping+0x50/0x50
<4>[  575.414644]  hrtimer_interrupt+0xc1/0x220
<4>[  575.414664]  smp_apic_timer_interrupt+0x7d/0x2e0
<4>[  575.414683]  apic_timer_interrupt+0x9a/0xa0
<4>[  575.414698]  </IRQ>
<4>[  575.414711] RIP: 0010:cpuidle_enter_state+0x136/0x370
<4>[  575.414729] RSP: 0018:ffffc90000117e80 EFLAGS: 00000212 ORIG_RAX: ffffffffffffff10
<4>[  575.414757] RAX: 0000000000000000 RBX: 000000001bc5f16e RCX: 0000000000000000
<4>[  575.414781] RDX: 00000085f95e6bec RSI: ffffffff81cb44fe RDI: ffffffff8175ebe2
<4>[  575.414803] RBP: ffffc90000117eb8 R08: 00000000001f3bc5 R09: 0000000000000018
<4>[  575.414826] R10: 00000000000718a0 R11: 00000000000b9768 R12: 0000000000000008
<4>[  575.414849] R13: 0000000000000008 R14: ffffe8ffffac4998 R15: 00000085dd987a7e
<4>[  575.414877]  ? cpuidle_enter_state+0x132/0x370
<4>[  575.414898]  cpuidle_enter+0x17/0x20
<4>[  575.414916]  call_cpuidle+0x23/0x40
<4>[  575.414932]  do_idle+0x192/0x1e0
<4>[  575.414951]  cpu_startup_entry+0x1d/0x20
<4>[  575.414968]  start_secondary+0x11c/0x140
<4>[  575.414988]  secondary_startup_64+0xa5/0xa5
<0>[  575.415093] Kernel Offset: disabled
Comment 1 Chris Wilson 2017-10-02 12:39:19 UTC
The error begun with

[  421.502977] [IGT] kms_pipe_crc_basic: starting subtest nonblocking-crc-pipe-B-frame-sequence

igt@kms_pipe_crc_basic@nonblocking-crc-pipe-c is an innocent victim.
Comment 2 Marta Löfstedt 2017-10-03 10:30:51 UTC
I consider this bug to be about all softdog incompletes on CFL. I have to file the bug on something, so don't stare at the tests so far implicated. 

Reproduced also on igt@kms_flip@basic-flip-vs-dpm.
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3165/fi-cfl-s/igt@kms_flip@basic-flip-vs-dpms.html
Comment 3 Marta Löfstedt 2017-10-03 10:33:28 UTC
This from dmesg appear frequently just before the softdog times out:

<3>[  354.277367] [drm:drm_atomic_helper_wait_for_flip_done] *ERROR* [CRTC:46:pipe B] flip_done timed out
<7>[  354.277608] [drm:verify_connector_state.isra.72 [i915]] [CONNECTOR:48:eDP-1]
<7>[  354.277718] [drm:intel_atomic_commit_tail [i915]] [CRTC:46:pipe B]
<7>[  354.277912] [drm:verify_single_dpll_state.isra.73 [i915]] DPLL 0
<14>[  354.289994] [IGT] kms_flip: exiting, ret=99
<3>[  364.517139] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [CRTC:46:pipe B] flip_done timed out
<3>[  374.757352] [drm:drm_atomic_helper_wait_for_flip_done] *ERROR* [CRTC:46:pipe B] flip_done timed out
<6>[  386.728770] [drm] GPU HANG: ecode 9:1:0xfffffffe, reason: Hang on bcs0, vcs0, vecs0, action: reset
<5>[  386.729156] i915 0000:00:02.0: Resetting bcs0 after gpu hang
<7>[  386.729960] [drm:gen8_init_common_ring [i915]] Execlists enabled for bcs0
<5>[  386.730083] i915 0000:00:02.0: Resetting vcs0 after gpu hang
<7>[  386.730312] [drm:gen8_init_common_ring [i915]] Execlists enabled for vcs0
<5>[  386.730320] i915 0000:00:02.0: Resetting vecs0 after gpu hang
<7>[  386.730458] [drm:gen8_init_common_ring [i915]] Execlists enabled for vecs0
<5>[  390.693538] i915 0000:00:02.0: Resetting rcs0 after gpu hang
<7>[  390.693959] [drm:gen8_init_common_ring [i915]] Execlists enabled for rcs0
<7>[  390.694170] [drm:init_workarounds_ring [i915]] rcs0: Number of context specific w/a: 16
<3>[  392.806459] [drm:i915_gem_idle_work_handler [i915]] *ERROR* Timeout waiting for engines to idle
<7>[  392.808157] [drm:missed_breadcrumb [i915]] rcs0 missed breadcrumb at intel_engine_disarm_breadcrumbs+0xb4/0xe0 [i915], irq posted? yes, current seqno=295740, last=295740
<7>[  392.833212] [drm:intel_atomic_check [i915]] New cdclk calculated to be logical 337500 kHz, actual 337500 kHz
<3>[  402.917171] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [CRTC:46:pipe B] flip_done timed out
<2>[  412.592339] softdog: Initiating panic
Comment 4 Marta Löfstedt 2017-10-05 11:33:35 UTC
Also:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3171/fi-cfl-s/igt@drv_module_reload@basic-reload-inject.html
and:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3179/fi-cfl-s/igt@drv_module_reload@basic-no-display.html

Note: before softdog is triggered dmesg:



<4>[  536.788670] DEBUG_LOCKS_WARN_ON(mutex_is_locked(lock))
<4>[  536.788683] ------------[ cut here ]------------
<4>[  536.788723] WARNING: CPU: 0 PID: 4959 at kernel/locking/mutex-debug.c:103 mutex_destroy+0x4e/0x60
<4>[  536.788748] Modules linked in: i915(-) snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec snd_hwdep snd_hda_core snd_pcm vgem ax88179_178a usbnet x86_pkg_temp_thermal mii intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel e1000e ptp pps_core prime_numbers i2c_hid [last unloaded: snd_hda_intel]
<4>[  536.788878] CPU: 0 PID: 4959 Comm: drv_module_relo Tainted: G     U          4.14.0-rc3-CI-CI_DRM_3171+ #1
<4>[  536.788905] Hardware name: Intel Corporation CoffeeLake Client Platform/CoffeeLake S UDIMM RVP, BIOS CNLSFWX1.R00.X104.A03.1709140524 09/14/2017
<4>[  536.788939] task: ffff8804146b5440 task.stack: ffffc9000054c000
<4>[  536.788959] RIP: 0010:mutex_destroy+0x4e/0x60
<4>[  536.788974] RSP: 0018:ffffc9000054fd58 EFLAGS: 00010292
<4>[  536.788992] RAX: 000000000000002a RBX: ffff88044d8be108 RCX: 0000000000000000
<4>[  536.789013] RDX: 0000000080000001 RSI: 0000000000000001 RDI: ffffffff810f0cf0
<4>[  536.789033] RBP: ffffc9000054fd60 R08: 0000000000000001 R09: 0000000000000001
<4>[  536.789054] R10: 000000000f21cb81 R11: 0000000000000000 R12: ffff8803f5d12e98
<4>[  536.789074] R13: ffffffffa02c1d20 R14: ffffffffa02c1d90 R15: ffff880459b29308
<4>[  536.789095] FS:  00007f3c5d91f8c0(0000) GS:ffff88045d200000(0000) knlGS:0000000000000000
<4>[  536.789119] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[  536.789137] CR2: 000055967e4cf090 CR3: 00000004579f7006 CR4: 00000000003606f0
<4>[  536.789157] Call Trace:
<4>[  536.789172]  drm_fb_helper_fini+0xd9/0x130
<4>[  536.789264]  intel_fbdev_destroy+0x12/0x60 [i915]
<4>[  536.789349]  intel_fbdev_fini+0x28/0x30 [i915]
<4>[  536.789432]  intel_modeset_cleanup+0x45/0xa0 [i915]
<4>[  536.789492]  i915_driver_unload+0x92/0x180 [i915]
<4>[  536.789554]  i915_pci_remove+0x19/0x30 [i915]
<4>[  536.789607]  pci_device_remove+0x39/0xb0
<4>[  536.789625]  device_release_driver_internal+0x15d/0x220
<4>[  536.789653]  driver_detach+0x40/0x80
<4>[  536.789673]  bus_remove_driver+0x58/0xd0
<4>[  536.789694]  driver_unregister+0x2c/0x40
<4>[  536.789715]  pci_unregister_driver+0x36/0xb0
<4>[  536.789799]  i915_exit+0x1a/0x8b [i915]
<4>[  536.789821]  SyS_delete_module+0x18c/0x1e0
<4>[  536.789842]  entry_SYSCALL_64_fastpath+0x1c/0xb1
<4>[  536.789858] RIP: 0033:0x7f3c5be37287
<4>[  536.789871] RSP: 002b:00007ffdf2f1f208 EFLAGS: 00000246 ORIG_RAX: 00000000000000b0
<4>[  536.789896] RAX: ffffffffffffffda RBX: ffffffff81493a03 RCX: 00007f3c5be37287
<4>[  536.789917] RDX: 0000000000000001 RSI: 0000000000000800 RDI: 00005641d2708e48
<4>[  536.789938] RBP: ffffc9000054ff88 R08: 0000000000000000 R09: 0000000000000080
<4>[  536.789958] R10: 00007f3c5d91f8c0 R11: 0000000000000246 R12: 0000000000000000
<4>[  536.789978] R13: 00007ffdf2f1f3f0 R14: 0000000000000000 R15: 0000000000000000
<4>[  536.790003]  ? __this_cpu_preempt_check+0x13/0x20
<4>[  536.790024] Code: 00 00 5b 5d c3 e8 93 b9 3a 00 85 c0 74 ec 8b 05 e1 53 c3 01 85 c0 75 e2 48 c7 c6 86 a6 c7 81 48 c7 c7 8b 8d c6 81 e8 03 ae 01 00 <0f> ff eb cb 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 55 48 b8 
<4>[  536.790210] ---[ end trace 824a7d8ce15a77b4 ]---
<7>[  536.829138] [drm:gmbus_xfer [i915]] GMBUS [i915 gmbus dpb] timed out, falling back to bit banging on pin 1
<7>[  536.843505] [drm:drm_do_probe_ddc_edid] drm: skipping non-existent adapter i915 gmbus dpb
<7>[  536.895952] [drm:gmbus_xfer [i915]] GMBUS [i915 gmbus dpb] timed out, falling back to bit banging on pin 1
<7>[  536.910417] [drm:drm_dp_dual_mode_detect] DP dual mode HDMI ID:  (err -6)
<7>[  536.910442] [drm:drm_helper_probe_single_connector_modes] [CONNECTOR:62:HDMI-A-1] disconnected
<4>[  536.910472] general protection fault: 0000 [#1] PREEMPT SMP
<4>[  536.910488] Modules linked in: i915(-) snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec snd_hwdep snd_hda_core snd_pcm vgem ax88179_178a usbnet x86_pkg_temp_thermal mii intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel e1000e ptp pps_core prime_numbers i2c_hid [last unloaded: snd_hda_intel]
<4>[  536.910601] CPU: 4 PID: 224 Comm: kworker/4:3 Tainted: G     U  W       4.14.0-rc3-CI-CI_DRM_3171+ #1
<4>[  536.910625] Hardware name: Intel Corporation CoffeeLake Client Platform/CoffeeLake S UDIMM RVP, BIOS CNLSFWX1.R00.X104.A03.1709140524 09/14/2017
<4>[  536.910718] Workqueue: events intel_dp_modeset_retry_work_fn [i915]
<4>[  536.910738] task: ffff8804573a8040 task.stack: ffffc9000121c000
<4>[  536.910757] RIP: 0010:drm_setup_crtcs+0x143/0xbf0
<4>[  536.910771] RSP: 0018:ffffc9000121fd20 EFLAGS: 00010202
<4>[  536.910787] RAX: 6b6b6b6b6b6b6b6b RBX: 0000000000000031 RCX: 0000000000000000
<4>[  536.910806] RDX: 0000000000000003 RSI: 0000000000000780 RDI: ffffc9000121fc98
<4>[  536.910824] RBP: ffffc9000121fdb8 R08: ffff8804533442f0 R09: ffff88044dae8e88
<4>[  536.910842] R10: ffffc9000121fb78 R11: 0000000000000000 R12: 0000000000000004
<4>[  536.910861] R13: ffff88044d8bdfa8 R14: 0000000000000780 R15: 0000000000000438
<4>[  536.910880] FS:  0000000000000000(0000) GS:ffff88045d300000(0000) knlGS:0000000000000000
<4>[  536.910901] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[  536.910917] CR2: 00007f0b3594e000 CR3: 0000000456b07004 CR4: 00000000003606e0
<4>[  536.910936] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[  536.910954] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
<4>[  536.910971] Call Trace:
<4>[  536.910990]  drm_fb_helper_hotplug_event.part.18+0x7e/0xc0
<4>[  536.911007]  drm_fb_helper_hotplug_event+0x1a/0x20
<4>[  536.911078]  intel_fbdev_output_poll_changed+0x1a/0x20 [i915]
<4>[  536.911096]  drm_kms_helper_hotplug_event+0x27/0x30
<4>[  536.911162]  intel_dp_modeset_retry_work_fn+0x77/0x80 [i915]
<4>[  536.911181]  process_one_work+0x233/0x660
<4>[  536.911199]  worker_thread+0x206/0x3b0
<4>[  536.911214]  kthread+0x152/0x190
<4>[  536.911226]  ? process_one_work+0x660/0x660
<4>[  536.911240]  ? kthread_create_on_node+0x40/0x40
<4>[  536.911257]  ret_from_fork+0x27/0x40
<4>[  536.911272] Code: 06 00 00 45 8b 45 20 31 db 45 31 e4 45 85 c0 0f 8e 91 06 00 00 44 8b 75 94 44 8b 7d 90 49 8b 45 28 49 63 d4 44 89 f6 41 83 c4 01 <48> 8b 04 d0 44 89 fa 48 8b 38 48 8b 87 a8 01 00 00 ff 50 20 01 
<1>[  536.911432] RIP: drm_setup_crtcs+0x143/0xbf0 RSP: ffffc9000121fd20
<4>[  536.911472] ---[ end trace 824a7d8ce15a77b5 ]---
<2>[  618.454192] softdog: Initiating panic
<0>[  618.454213] Kernel panic - not syncing: Software Watchdog Timer expired
<0>[  618.454308] Kernel Offset: disabled
Comment 6 Maarten Lankhorst 2017-10-27 07:19:32 UTC
The drv_module_reload_basic hangs should be gone now with

commit 886c6b8692ba5f71b578097524b3b082e2e02119 (drm-intel/for-linux-next, drm-intel/drm-intel-next-queued)
Author: Manasi Navare <manasi.d.navare@intel.com>
Date:   Thu Oct 26 14:52:00 2017 -0700

    drm/i915: Cancel the modeset retry work during modeset cleanup
Comment 7 Jani Saarinen 2017-10-27 10:11:55 UTC
Integrated in https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3288/commits_short.log
Comment 8 Marta Löfstedt 2017-10-27 12:09:06 UTC
There hasn't been any softdog incomplete on CFL-s since https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3179/fi-cfl-s/igt@drv_module_reload@basic-no-display.html, so I am not sure the patch has anything to do with it. Anyways it is fine with me to close this.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.