Summary: | [BAT][KBL] igt@gem_exec_suspend@basic-s3 - Incomplete - | ||
---|---|---|---|
Product: | DRI | Reporter: | Marta Löfstedt <marta.lofstedt> |
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Status: | CLOSED WORKSFORME | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Severity: | blocker | ||
Priority: | high | CC: | intel-gfx-bugs, jani.saarinen |
Version: | DRI git | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | ReadyForDev | ||
i915 platform: | KBL | i915 features: |
Description
Marta Löfstedt
2017-09-19 06:45:23 UTC
Probably related to #102849; causal relationship unclear. Sometimes we see [ 38.681288] ====================================================== [ 38.681288] WARNING: possible circular locking dependency detected [ 38.681289] 4.14.0-rc1-CI-CI_DRM_3099+ #1 Not tainted [ 38.681290] ------------------------------------------------------ [ 38.681290] rtcwake/1414 is trying to acquire lock: [ 38.681291] ((complete)&st->done){+.+.}, at: [<ffffffff8190987d>] wait_for_completion+0x1d/0x20 [ 38.681295] but task is already holding lock: [ 38.681295] (sparse_irq_lock){+.+.}, at: [<ffffffff810f2187>] irq_lock_sparse+0x17/0x20 [ 38.681298] which lock already depends on the new lock. [ 38.681298] the existing dependency chain (in reverse order) is: [ 38.681299] -> #1 (sparse_irq_lock){+.+.}: [ 38.681301] __mutex_lock+0x86/0x9b0 [ 38.681302] mutex_lock_nested+0x1b/0x20 [ 38.681303] irq_lock_sparse+0x17/0x20 [ 38.681304] irq_affinity_online_cpu+0x18/0xd0 [ 38.681305] cpuhp_invoke_callback+0xa3/0x840 [ 38.681306] -> #0 ((complete)&st->done){+.+.}: [ 38.681308] check_prev_add+0x430/0x840 [ 38.681309] __lock_acquire+0x1420/0x15e0 [ 38.681310] lock_acquire+0xb0/0x200 [ 38.681311] wait_for_common+0x58/0x210 [ 38.681311] wait_for_completion+0x1d/0x20 [ 38.681312] takedown_cpu+0x89/0xf0 [ 38.681313] cpuhp_invoke_callback+0xa3/0x840 [ 38.681314] cpuhp_down_callbacks+0x42/0x80 [ 38.681314] _cpu_down+0xb9/0xf0 [ 38.681315] freeze_secondary_cpus+0xa3/0x390 [ 38.681316] hibernation_snapshot+0x24c/0x5f0 [ 38.681317] hibernate+0x14f/0x2b1 [ 38.681318] state_store+0xe5/0xf0 [ 38.681319] kobj_attr_store+0xf/0x20 [ 38.681321] sysfs_kf_write+0x45/0x60 [ 38.681322] kernfs_fop_write+0x124/0x1c0 [ 38.681323] __vfs_write+0x28/0x130 [ 38.681324] vfs_write+0xcb/0x1c0 [ 38.681324] SyS_write+0x49/0xb0 [ 38.681326] entry_SYSCALL_64_fastpath+0x1c/0xb1 [ 38.681326] other info that might help us debug this: [ 38.681326] Possible unsafe locking scenario: [ 38.681326] CPU0 CPU1 [ 38.681327] ---- ---- [ 38.681327] lock(sparse_irq_lock); [ 38.681328] lock((complete)&st->done); [ 38.681328] lock(sparse_irq_lock); [ 38.681329] lock((complete)&st->done); [ 38.681330] *** DEADLOCK *** [ 38.681330] 9 locks held by rtcwake/1414: [ 38.681331] #0: (sb_writers#5){.+.+}, at: [<ffffffff81220161>] vfs_write+0x171/0x1c0 [ 38.681333] #1: (&of->mutex){+.+.}, at: [<ffffffff812a3302>] kernfs_fop_write+0xf2/0x1c0 [ 38.681335] #2: (kn->count#206){.+.+}, at: [<ffffffff812a330b>] kernfs_fop_write+0xfb/0x1c0 [ 38.681337] #3: (pm_mutex){+.+.}, at: [<ffffffff810e79b9>] hibernate+0x59/0x2b1 [ 38.681339] #4: (device_hotplug_lock){+.+.}, at: [<ffffffff81617ff7>] lock_device_hotplug+0x17/0x20 [ 38.681342] #5: (acpi_scan_lock){+.+.}, at: [<ffffffff8153b3c7>] acpi_scan_lock_acquire+0x17/0x20 [ 38.681345] #6: (cpu_add_remove_lock){+.+.}, at: [<ffffffff8108106e>] freeze_secondary_cpus+0x2e/0x390 [ 38.681347] #7: (cpu_hotplug_lock.rw_sem){++++}, at: [<ffffffff810d660b>] percpu_down_write+0x2b/0x110 [ 38.681349] #8: (sparse_irq_lock){+.+.}, at: [<ffffffff810f2187>] irq_lock_sparse+0x17/0x20 [ 38.681351] stack backtrace: [ 38.681353] CPU: 2 PID: 1414 Comm: rtcwake Not tainted 4.14.0-rc1-CI-CI_DRM_3099+ #1 [ 38.681353] Hardware name: /NUC7i5BNB, BIOS BNKBL357.86A.0048.2017.0704.1415 07/04/2017 [ 38.681354] Call Trace: [ 38.681355] dump_stack+0x68/0x9f [ 38.681357] print_circular_bug+0x235/0x3c0 [ 38.681358] ? lockdep_init_map_crosslock+0x20/0x20 [ 38.681359] check_prev_add+0x430/0x840 [ 38.681361] __lock_acquire+0x1420/0x15e0 [ 38.681362] ? __lock_acquire+0x1420/0x15e0 [ 38.681363] ? lockdep_init_map_crosslock+0x20/0x20 [ 38.681364] lock_acquire+0xb0/0x200 [ 38.681365] ? wait_for_completion+0x1d/0x20 [ 38.681366] wait_for_common+0x58/0x210 [ 38.681367] ? wait_for_completion+0x1d/0x20 [ 38.681368] ? cpuhp_invoke_callback+0x840/0x840 [ 38.681370] ? stop_machine_cpuslocked+0xc1/0xd0 [ 38.681370] ? cpuhp_invoke_callback+0x840/0x840 [ 38.681371] wait_for_completion+0x1d/0x20 [ 38.681372] takedown_cpu+0x89/0xf0 [ 38.681373] ? cpuhp_complete_idle_dead+0x20/0x20 [ 38.681374] cpuhp_invoke_callback+0xa3/0x840 [ 38.681375] cpuhp_down_callbacks+0x42/0x80 [ 38.681376] _cpu_down+0xb9/0xf0 [ 38.681377] freeze_secondary_cpus+0xa3/0x390 [ 38.681378] hibernation_snapshot+0x24c/0x5f0 [ 38.681379] hibernate+0x14f/0x2b1 [ 38.681380] state_store+0xe5/0xf0 [ 38.681381] kobj_attr_store+0xf/0x20 [ 38.681383] sysfs_kf_write+0x45/0x60 [ 38.681384] kernfs_fop_write+0x124/0x1c0 [ 38.681385] __vfs_write+0x28/0x130 [ 38.681386] ? rcu_read_lock_sched_held+0x7a/0x90 [ 38.681387] ? rcu_sync_lockdep_assert+0x2f/0x60 [ 38.681388] ? __sb_start_write+0x108/0x200 [ 38.681389] vfs_write+0xcb/0x1c0 [ 38.681390] SyS_write+0x49/0xb0 [ 38.681391] entry_SYSCALL_64_fastpath+0x1c/0xb1 [ 38.681392] RIP: 0033:0x7f5ed20048f0 [ 38.681393] RSP: 002b:00007ffc6aa4ba98 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 38.681394] RAX: ffffffffffffffda RBX: ffffffff81492963 RCX: 00007f5ed20048f0 [ 38.681394] RDX: 0000000000000005 RSI: 00005572b5c20060 RDI: 0000000000000007 [ 38.681395] RBP: ffffc9000058bf88 R08: 00005572b5c1ddc0 R09: 00007f5ed24d6700 [ 38.681395] R10: 00007f5ed22cdb58 R11: 0000000000000246 R12: 00005572b5c1dce0 [ 38.681396] R13: 0000000000000001 R14: 0000000000000005 R15: 0000000000000005 [ 38.681398] ? __this_cpu_preempt_check+0x13/0x20 on a prior suspend. (In reply to Chris Wilson from comment #2) > Sometimes we see > > [ 38.681288] ====================================================== > [ 38.681288] WARNING: possible circular locking dependency detected > [ 38.681289] 4.14.0-rc1-CI-CI_DRM_3099+ #1 Not tainted The lock-dep is present on all runs so far with 4.14 kernel, see BUG 102505. Incomplete reproduced on: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3119/fi-kbl-7500u/igt@gem_exec_suspend@basic-s3.html Note just before there was: igt@gem_exec_store@basic-all [ 230.632776] [drm:fw_domains_get [i915]] *ERROR* render: timed out waiting for forcewake ack request. [ 230.699470] Setting dangerous option reset - tainting kernel igt@gem_exec_store@basic-default [ 232.276598] [drm:fw_domains_get [i915]] *ERROR* render: timed out waiting for forcewake ack request. [ 232.437161] Setting dangerous option reset - tainting kernel This pattern is present in most of the incompletes starting from CI_DRM_3099, but not on: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3113/fi-kbl-7500u/igt@gem_exec_suspend@basic-s3.html History from: http://benchsrv.fi.intel.com/cibuglog/?action_failures_history=287 Raising priority as causing hard hangs and causing unreliable CI data. Faced this issue with gem_mmap_gtt and gem_tiled_blits with custom patchset. https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_5800/fi-kbl-7500u/igt@gem_mmap_gtt@basic-write-read.html https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_5800/fi-kbl-7500u/igt@gem_tiled_blits@basic.html Here is a new incomplete hang hang without any: "render: timed out waiting for forcewake ack request" in dmesg: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3125/fi-kbl-7500u/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-a.html *** Bug 103039 has been marked as a duplicate of this bug. *** There was also dmesg fail on same test: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3158/fi-kbl-7560u/igt@gem_exec_suspend@basic-s4-devices.html [ 433.598425] [drm:i915_gem_idle_work_handler [i915]] *ERROR* Timeout waiting for engines to idle [ 443.774330] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [CRTC:36:pipe A] flip_done timed out [ 454.014178] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [PLANE:27:plane 1A] flip_done timed out [ 464.254342] [drm:drm_atomic_helper_wait_for_flip_done] *ERROR* [CRTC:36:pipe A] flip_done timed out And from S4 too: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3158/fi-kbl-7560u/igt@gem_exec_suspend@basic-s4-devices.html [ 309.567128] [drm:i915_gem_idle_work_handler [i915]] *ERROR* Timeout waiting for engines to idle [ 319.870125] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [CRTC:36:pipe A] flip_done timed out [ 330.110396] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [PLANE:27:plane 1A] flip_done timed out [ 340.350265] [drm:drm_atomic_helper_wait_for_flip_done] *ERROR* [CRTC:36:pipe A] flip_done timed out <3>[ 727.934290] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [CRTC:36:pipe A] flip_done timed out <2>[ 738.088137] softdog: Initiating panic <0>[ 738.088190] Kernel panic - not syncing: Software Watchdog Timer expired <4>[ 738.088290] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G U 4.14.0-rc2-CI-CI_DRM_3158+ #1 <4>[ 738.088399] Hardware name: Dell Inc. XPS 13 9360/093TW6, BIOS 1.3.2 01/18/2017 <4>[ 738.088475] Call Trace: <4>[ 738.088525] <IRQ> <4>[ 738.088561] dump_stack+0x68/0x9f <4>[ 738.088627] panic+0xd4/0x21d <4>[ 738.088675] ? printk+0x43/0x4b <4>[ 738.088741] softdog_fire+0x6b/0x70 <4>[ 738.088792] __hrtimer_run_queues+0xed/0x4d0 <4>[ 738.088855] ? softdog_ping+0x50/0x50 <4>[ 738.088915] hrtimer_interrupt+0xc1/0x220 <4>[ 738.088982] smp_apic_timer_interrupt+0x7d/0x2e0 <4>[ 738.089042] apic_timer_interrupt+0x9a/0xa0 <4>[ 738.089098] </IRQ> <4>[ 738.089151] RIP: 0010:cpuidle_enter_state+0x136/0x370 <4>[ 738.089205] RSP: 0018:ffffc900000c3e80 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff10 <4>[ 738.089306] RAX: ffff880274b62a40 RBX: 00000000000029f7 RCX: 0000000000000001 <4>[ 738.089400] RDX: 0000000000000000 RSI: ffffffff81d0e06c RDI: ffffffff81cc1ea6 <4>[ 738.089474] RBP: ffffc900000c3eb8 R08: 00000000ffffffff R09: 0000000000000008 <4>[ 738.089550] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001 <4>[ 738.089643] R13: 0000000000000001 R14: ffff88027e5a58a0 R15: 000000abd97ad493 <4>[ 738.089743] cpuidle_enter+0x17/0x20 <4>[ 738.089793] call_cpuidle+0x23/0x40 <4>[ 738.089846] do_idle+0x192/0x1e0 <4>[ 738.089908] cpu_startup_entry+0x1d/0x20 <4>[ 738.089957] start_secondary+0x11c/0x140 <4>[ 738.090028] secondary_startup_64+0xa5/0xa5 <0>[ 738.090183] Kernel Offset: disabled (In reply to Jani Saarinen from comment #10) > And from S4 too: > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3158/fi-kbl-7560u/ > igt@gem_exec_suspend@basic-s4-devices.html > Jani I don't think this is duplicate. This bug is about HARD HANGS we saw on the fi-kbl-7500u after 4.14.0-rc1 there is also a possible dependency that BUG 102849 happens before this. The BUG 103039 is not a HARD HANG it has pstore. Since this issue with HARD HANG on the fi-kbl-7560u and the potentially related BUG 102849, has not been reproduced since 4.14.0-rc2 my plan was to close these after discussing with Mika Kuoppala. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.