Bug 102850

Summary: [BAT][KBL] igt@gem_exec_suspend@basic-s3 - Incomplete -
Product: DRI Reporter: Marta Löfstedt <marta.lofstedt>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED WORKSFORME QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: blocker    
Priority: high CC: intel-gfx-bugs, jani.saarinen
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard: ReadyForDev
i915 platform: KBL i915 features:

Comment 1 Chris Wilson 2017-09-19 08:19:28 UTC
Probably related to #102849; causal relationship unclear.
Comment 2 Chris Wilson 2017-09-19 08:30:18 UTC
Sometimes we see

[   38.681288] ======================================================
[   38.681288] WARNING: possible circular locking dependency detected
[   38.681289] 4.14.0-rc1-CI-CI_DRM_3099+ #1 Not tainted
[   38.681290] ------------------------------------------------------
[   38.681290] rtcwake/1414 is trying to acquire lock:
[   38.681291]  ((complete)&st->done){+.+.}, at: [<ffffffff8190987d>] wait_for_completion+0x1d/0x20
[   38.681295]
               but task is already holding lock:
[   38.681295]  (sparse_irq_lock){+.+.}, at: [<ffffffff810f2187>] irq_lock_sparse+0x17/0x20
[   38.681298]
               which lock already depends on the new lock.

[   38.681298]
               the existing dependency chain (in reverse order) is:
[   38.681299]
               -> #1 (sparse_irq_lock){+.+.}:
[   38.681301]        __mutex_lock+0x86/0x9b0
[   38.681302]        mutex_lock_nested+0x1b/0x20
[   38.681303]        irq_lock_sparse+0x17/0x20
[   38.681304]        irq_affinity_online_cpu+0x18/0xd0
[   38.681305]        cpuhp_invoke_callback+0xa3/0x840
[   38.681306]
               -> #0 ((complete)&st->done){+.+.}:
[   38.681308]        check_prev_add+0x430/0x840
[   38.681309]        __lock_acquire+0x1420/0x15e0
[   38.681310]        lock_acquire+0xb0/0x200
[   38.681311]        wait_for_common+0x58/0x210
[   38.681311]        wait_for_completion+0x1d/0x20
[   38.681312]        takedown_cpu+0x89/0xf0
[   38.681313]        cpuhp_invoke_callback+0xa3/0x840
[   38.681314]        cpuhp_down_callbacks+0x42/0x80
[   38.681314]        _cpu_down+0xb9/0xf0
[   38.681315]        freeze_secondary_cpus+0xa3/0x390
[   38.681316]        hibernation_snapshot+0x24c/0x5f0
[   38.681317]        hibernate+0x14f/0x2b1
[   38.681318]        state_store+0xe5/0xf0
[   38.681319]        kobj_attr_store+0xf/0x20
[   38.681321]        sysfs_kf_write+0x45/0x60
[   38.681322]        kernfs_fop_write+0x124/0x1c0
[   38.681323]        __vfs_write+0x28/0x130
[   38.681324]        vfs_write+0xcb/0x1c0
[   38.681324]        SyS_write+0x49/0xb0
[   38.681326]        entry_SYSCALL_64_fastpath+0x1c/0xb1
[   38.681326]
               other info that might help us debug this:

[   38.681326]  Possible unsafe locking scenario:

[   38.681326]        CPU0                    CPU1
[   38.681327]        ----                    ----
[   38.681327]   lock(sparse_irq_lock);
[   38.681328]                                lock((complete)&st->done);
[   38.681328]                                lock(sparse_irq_lock);
[   38.681329]   lock((complete)&st->done);
[   38.681330]
                *** DEADLOCK ***

[   38.681330] 9 locks held by rtcwake/1414:
[   38.681331]  #0:  (sb_writers#5){.+.+}, at: [<ffffffff81220161>] vfs_write+0x171/0x1c0
[   38.681333]  #1:  (&of->mutex){+.+.}, at: [<ffffffff812a3302>] kernfs_fop_write+0xf2/0x1c0
[   38.681335]  #2:  (kn->count#206){.+.+}, at: [<ffffffff812a330b>] kernfs_fop_write+0xfb/0x1c0
[   38.681337]  #3:  (pm_mutex){+.+.}, at: [<ffffffff810e79b9>] hibernate+0x59/0x2b1
[   38.681339]  #4:  (device_hotplug_lock){+.+.}, at: [<ffffffff81617ff7>] lock_device_hotplug+0x17/0x20
[   38.681342]  #5:  (acpi_scan_lock){+.+.}, at: [<ffffffff8153b3c7>] acpi_scan_lock_acquire+0x17/0x20
[   38.681345]  #6:  (cpu_add_remove_lock){+.+.}, at: [<ffffffff8108106e>] freeze_secondary_cpus+0x2e/0x390
[   38.681347]  #7:  (cpu_hotplug_lock.rw_sem){++++}, at: [<ffffffff810d660b>] percpu_down_write+0x2b/0x110
[   38.681349]  #8:  (sparse_irq_lock){+.+.}, at: [<ffffffff810f2187>] irq_lock_sparse+0x17/0x20
[   38.681351]
               stack backtrace:
[   38.681353] CPU: 2 PID: 1414 Comm: rtcwake Not tainted 4.14.0-rc1-CI-CI_DRM_3099+ #1
[   38.681353] Hardware name:                  /NUC7i5BNB, BIOS BNKBL357.86A.0048.2017.0704.1415 07/04/2017
[   38.681354] Call Trace:
[   38.681355]  dump_stack+0x68/0x9f
[   38.681357]  print_circular_bug+0x235/0x3c0
[   38.681358]  ? lockdep_init_map_crosslock+0x20/0x20
[   38.681359]  check_prev_add+0x430/0x840
[   38.681361]  __lock_acquire+0x1420/0x15e0
[   38.681362]  ? __lock_acquire+0x1420/0x15e0
[   38.681363]  ? lockdep_init_map_crosslock+0x20/0x20
[   38.681364]  lock_acquire+0xb0/0x200
[   38.681365]  ? wait_for_completion+0x1d/0x20
[   38.681366]  wait_for_common+0x58/0x210
[   38.681367]  ? wait_for_completion+0x1d/0x20
[   38.681368]  ? cpuhp_invoke_callback+0x840/0x840
[   38.681370]  ? stop_machine_cpuslocked+0xc1/0xd0
[   38.681370]  ? cpuhp_invoke_callback+0x840/0x840
[   38.681371]  wait_for_completion+0x1d/0x20
[   38.681372]  takedown_cpu+0x89/0xf0
[   38.681373]  ? cpuhp_complete_idle_dead+0x20/0x20
[   38.681374]  cpuhp_invoke_callback+0xa3/0x840
[   38.681375]  cpuhp_down_callbacks+0x42/0x80
[   38.681376]  _cpu_down+0xb9/0xf0
[   38.681377]  freeze_secondary_cpus+0xa3/0x390
[   38.681378]  hibernation_snapshot+0x24c/0x5f0
[   38.681379]  hibernate+0x14f/0x2b1
[   38.681380]  state_store+0xe5/0xf0
[   38.681381]  kobj_attr_store+0xf/0x20
[   38.681383]  sysfs_kf_write+0x45/0x60
[   38.681384]  kernfs_fop_write+0x124/0x1c0
[   38.681385]  __vfs_write+0x28/0x130
[   38.681386]  ? rcu_read_lock_sched_held+0x7a/0x90
[   38.681387]  ? rcu_sync_lockdep_assert+0x2f/0x60
[   38.681388]  ? __sb_start_write+0x108/0x200
[   38.681389]  vfs_write+0xcb/0x1c0
[   38.681390]  SyS_write+0x49/0xb0
[   38.681391]  entry_SYSCALL_64_fastpath+0x1c/0xb1
[   38.681392] RIP: 0033:0x7f5ed20048f0
[   38.681393] RSP: 002b:00007ffc6aa4ba98 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[   38.681394] RAX: ffffffffffffffda RBX: ffffffff81492963 RCX: 00007f5ed20048f0
[   38.681394] RDX: 0000000000000005 RSI: 00005572b5c20060 RDI: 0000000000000007
[   38.681395] RBP: ffffc9000058bf88 R08: 00005572b5c1ddc0 R09: 00007f5ed24d6700
[   38.681395] R10: 00007f5ed22cdb58 R11: 0000000000000246 R12: 00005572b5c1dce0
[   38.681396] R13: 0000000000000001 R14: 0000000000000005 R15: 0000000000000005
[   38.681398]  ? __this_cpu_preempt_check+0x13/0x20

on a prior suspend.
Comment 3 Marta Löfstedt 2017-09-22 07:43:50 UTC
(In reply to Chris Wilson from comment #2)
> Sometimes we see
> 
> [   38.681288] ======================================================
> [   38.681288] WARNING: possible circular locking dependency detected
> [   38.681289] 4.14.0-rc1-CI-CI_DRM_3099+ #1 Not tainted

The lock-dep is present on all runs so far with 4.14 kernel, see BUG 102505.
Comment 4 Marta Löfstedt 2017-09-22 07:57:42 UTC
Incomplete reproduced on:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3119/fi-kbl-7500u/igt@gem_exec_suspend@basic-s3.html

Note just before there was:
igt@gem_exec_store@basic-all
[  230.632776] [drm:fw_domains_get [i915]] *ERROR* render: timed out waiting for forcewake ack request.
[  230.699470] Setting dangerous option reset - tainting kernel
igt@gem_exec_store@basic-default
[  232.276598] [drm:fw_domains_get [i915]] *ERROR* render: timed out waiting for forcewake ack request.
[  232.437161] Setting dangerous option reset - tainting kernel

This pattern is present in most of the incompletes starting from CI_DRM_3099, but not on:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3113/fi-kbl-7500u/igt@gem_exec_suspend@basic-s3.html
Comment 5 Jani Saarinen 2017-09-22 10:50:16 UTC
History from: http://benchsrv.fi.intel.com/cibuglog/?action_failures_history=287

Raising priority as causing hard hangs and causing unreliable CI data.
Comment 7 Marta Löfstedt 2017-09-25 06:47:22 UTC
Here is a new incomplete hang hang without any: "render: timed out waiting for forcewake ack request" in dmesg:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3125/fi-kbl-7500u/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-a.html
Comment 8 Jani Saarinen 2017-10-01 07:47:57 UTC
*** Bug 103039 has been marked as a duplicate of this bug. ***
Comment 9 Jani Saarinen 2017-10-01 08:11:46 UTC
There was also dmesg fail on same test:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3158/fi-kbl-7560u/igt@gem_exec_suspend@basic-s4-devices.html

[  433.598425] [drm:i915_gem_idle_work_handler [i915]] *ERROR* Timeout waiting for engines to idle
[  443.774330] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [CRTC:36:pipe A] flip_done timed out
[  454.014178] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [PLANE:27:plane 1A] flip_done timed out
[  464.254342] [drm:drm_atomic_helper_wait_for_flip_done] *ERROR* [CRTC:36:pipe A] flip_done timed out
Comment 10 Jani Saarinen 2017-10-01 08:21:28 UTC
And from S4 too:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3158/fi-kbl-7560u/igt@gem_exec_suspend@basic-s4-devices.html

[  309.567128] [drm:i915_gem_idle_work_handler [i915]] *ERROR* Timeout waiting for engines to idle
[  319.870125] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [CRTC:36:pipe A] flip_done timed out
[  330.110396] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [PLANE:27:plane 1A] flip_done timed out
[  340.350265] [drm:drm_atomic_helper_wait_for_flip_done] *ERROR* [CRTC:36:pipe A] flip_done timed out

<3>[  727.934290] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [CRTC:36:pipe A] flip_done timed out
<2>[  738.088137] softdog: Initiating panic
<0>[  738.088190] Kernel panic - not syncing: Software Watchdog Timer expired
<4>[  738.088290] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G     U          4.14.0-rc2-CI-CI_DRM_3158+ #1
<4>[  738.088399] Hardware name: Dell Inc. XPS 13 9360/093TW6, BIOS 1.3.2 01/18/2017
<4>[  738.088475] Call Trace:
<4>[  738.088525]  <IRQ>
<4>[  738.088561]  dump_stack+0x68/0x9f
<4>[  738.088627]  panic+0xd4/0x21d
<4>[  738.088675]  ? printk+0x43/0x4b
<4>[  738.088741]  softdog_fire+0x6b/0x70
<4>[  738.088792]  __hrtimer_run_queues+0xed/0x4d0
<4>[  738.088855]  ? softdog_ping+0x50/0x50
<4>[  738.088915]  hrtimer_interrupt+0xc1/0x220
<4>[  738.088982]  smp_apic_timer_interrupt+0x7d/0x2e0
<4>[  738.089042]  apic_timer_interrupt+0x9a/0xa0
<4>[  738.089098]  </IRQ>
<4>[  738.089151] RIP: 0010:cpuidle_enter_state+0x136/0x370
<4>[  738.089205] RSP: 0018:ffffc900000c3e80 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff10
<4>[  738.089306] RAX: ffff880274b62a40 RBX: 00000000000029f7 RCX: 0000000000000001
<4>[  738.089400] RDX: 0000000000000000 RSI: ffffffff81d0e06c RDI: ffffffff81cc1ea6
<4>[  738.089474] RBP: ffffc900000c3eb8 R08: 00000000ffffffff R09: 0000000000000008
<4>[  738.089550] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
<4>[  738.089643] R13: 0000000000000001 R14: ffff88027e5a58a0 R15: 000000abd97ad493
<4>[  738.089743]  cpuidle_enter+0x17/0x20
<4>[  738.089793]  call_cpuidle+0x23/0x40
<4>[  738.089846]  do_idle+0x192/0x1e0
<4>[  738.089908]  cpu_startup_entry+0x1d/0x20
<4>[  738.089957]  start_secondary+0x11c/0x140
<4>[  738.090028]  secondary_startup_64+0xa5/0xa5
<0>[  738.090183] Kernel Offset: disabled
Comment 11 Marta Löfstedt 2017-10-02 11:17:34 UTC
(In reply to Jani Saarinen from comment #10)
> And from S4 too:
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3158/fi-kbl-7560u/
> igt@gem_exec_suspend@basic-s4-devices.html
> 
 Jani I don't think this is duplicate. This bug is about HARD HANGS we saw on the fi-kbl-7500u after 4.14.0-rc1 there is also a possible dependency that BUG 102849 happens before this.

The BUG 103039 is not a HARD HANG it has pstore.

Since this issue with HARD HANG on the fi-kbl-7560u and the potentially related BUG 102849, has not been reproduced since 4.14.0-rc2 my plan was to close these after discussing with Mika Kuoppala.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.