Bug 103026

Summary: [BAT] igt@gem_exec_suspend@basic-s3 lock dep before *ERROR* failed to enable link training
Product: DRI Reporter: Jani Saarinen <jani.saarinen>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED DUPLICATE QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: critical    
Priority: highest CC: intel-gfx-bugs
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard: ReadyForDev
i915 platform: CFL i915 features: power/suspend-resume

Description Jani Saarinen 2017-09-28 15:04:07 UTC
On CI_DRM_3150 on cfl igt@gem_exec_suspend@basic-s3 gives lock dep error and after that link training failure:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3150/fi-cfl-s/igt@gem_exec_suspend@basic-s3.html

Relation to bug 102505 where similar issues seen.
Comment 1 Jani Saarinen 2017-09-28 15:04:15 UTC
Results for igt@gem_exec_suspend@basic-s3

Result: dmesg-warn


Out	
IGT-Version: 1.19-g2885b10f (x86_64) (Linux: 4.14.0-rc2-CI-CI_DRM_3150+ x86_64)
rtcwake: wakeup from "mem" using /dev/rtc0 at Thu Sep 28 13:52:19 2017
Subtest basic-S3: SUCCESS (2.859s)
Environment	
PIGLIT_SOURCE_DIR="/opt/igt/piglit" PIGLIT_PLATFORM="mixed_glx_egl"

Command	
/opt/igt/libexec/intel-gpu-tools/gem_exec_suspend --run-subtest basic-S3

Dmesg	
[  216.484624] Setting dangerous option reset - tainting kernel
[  217.297601] Suspending console(s) (use no_console_suspend to debug)
[  217.636166] sd 3:0:0:0: [sda] Synchronizing SCSI cache
[  217.640116] sd 3:0:0:0: [sda] Stopping disk
[  218.081622] IRQ fixup: irq 1 move in progress, old vector 49

[  218.082967] ======================================================
[  218.082967] WARNING: possible circular locking dependency detected
[  218.082968] 4.14.0-rc2-CI-CI_DRM_3150+ #1 Tainted: G     U  W      
[  218.082969] ------------------------------------------------------
[  218.082969] rtcwake/3282 is trying to acquire lock:
[  218.082970]  ((complete)&st->done){+.+.}, at: [<ffffffff81909e0d>] wait_for_completion+0x1d/0x20
[  218.082973] 
               but task is already holding lock:
[  218.082974]  (sparse_irq_lock){+.+.}, at: [<ffffffff810f22e7>] irq_lock_sparse+0x17/0x20
[  218.082976] 
               which lock already depends on the new lock.

[  218.082976] 
               the existing dependency chain (in reverse order) is:
[  218.082976] 
               -> #2 (sparse_irq_lock){+.+.}:
[  218.082979]        __lock_acquire+0x1420/0x15e0
[  218.082980]        lock_acquire+0xb0/0x200
[  218.082981]        __mutex_lock+0x86/0x9b0
[  218.082981]        mutex_lock_nested+0x1b/0x20
[  218.082982]        irq_lock_sparse+0x17/0x20
[  218.082983]        irq_affinity_online_cpu+0x18/0xd0
[  218.082984]        cpuhp_invoke_callback+0xa3/0x840
[  218.082985]        cpuhp_up_callbacks+0x36/0xc0
[  218.082986]        cpuhp_thread_fun+0x142/0x150
[  218.082987]        smpboot_thread_fn+0x18a/0x280
[  218.082988]        kthread+0x152/0x190
[  218.082989]        ret_from_fork+0x27/0x40
[  218.082989] 
               -> #1 (cpuhp_state){+.+.}:
[  218.082991]        cpuhp_thread_fun+0x54/0x150
[  218.082991]        smpboot_thread_fn+0x18a/0x280
[  218.082992]        kthread+0x152/0x190
[  218.082993]        ret_from_fork+0x27/0x40
[  218.082993]        0xffffffffffffffff
[  218.082993] 
               -> #0 ((complete)&st->done){+.+.}:
[  218.082995]        check_prev_add+0x430/0x840
[  218.082996]        __lock_acquire+0x1420/0x15e0
[  218.082997]        lock_acquire+0xb0/0x200
[  218.082997]        wait_for_common+0x58/0x210
[  218.082998]        wait_for_completion+0x1d/0x20
[  218.082999]        takedown_cpu+0x89/0xf0
[  218.082999]        cpuhp_invoke_callback+0xa3/0x840
[  218.083000]        cpuhp_down_callbacks+0x42/0x80
[  218.083001]        _cpu_down+0xb9/0xf0
[  218.083001]        freeze_secondary_cpus+0xa3/0x390
[  218.083002]        suspend_devices_and_enter+0x2fd/0xce0
[  218.083003]        pm_suspend+0x4f0/0x9d0
[  218.083003]        state_store+0x82/0xf0
[  218.083004]        kobj_attr_store+0xf/0x20
[  218.083006]        sysfs_kf_write+0x45/0x60
[  218.083007]        kernfs_fop_write+0x124/0x1c0
[  218.083007]        __vfs_write+0x28/0x130
[  218.083008]        vfs_write+0xcb/0x1c0
[  218.083009]        SyS_write+0x49/0xb0
[  218.083010]        entry_SYSCALL_64_fastpath+0x1c/0xb1
[  218.083010] 
               other info that might help us debug this:

[  218.083010] Chain exists of:
                 (complete)&st->done --> cpuhp_state --> sparse_irq_lock

[  218.083012]  Possible unsafe locking scenario:

[  218.083012]        CPU0                    CPU1
[  218.083012]        ----                    ----
[  218.083012]   lock(sparse_irq_lock);
[  218.083013]                                lock(cpuhp_state);
[  218.083013]                                lock(sparse_irq_lock);
[  218.083014]   lock((complete)&st->done);
[  218.083015] 
                *** DEADLOCK ***

[  218.083015] 8 locks held by rtcwake/3282:
[  218.083015]  #0:  (sb_writers#5){.+.+}, at: [<ffffffff81220701>] vfs_write+0x171/0x1c0
[  218.083017]  #1:  (&of->mutex){+.+.}, at: [<ffffffff812a39d2>] kernfs_fop_write+0xf2/0x1c0
[  218.083019]  #2:  (kn->count#214){.+.+}, at: [<ffffffff812a39db>] kernfs_fop_write+0xfb/0x1c0
[  218.083021]  #3:  (pm_mutex){+.+.}, at: [<ffffffff810e60a9>] pm_suspend+0xa9/0x9d0
[  218.083023]  #4:  (acpi_scan_lock){+.+.}, at: [<ffffffff8153b8b7>] acpi_scan_lock_acquire+0x17/0x20
[  218.083025]  #5:  (cpu_add_remove_lock){+.+.}, at: [<ffffffff810811ce>] freeze_secondary_cpus+0x2e/0x390
[  218.083027]  #6:  (cpu_hotplug_lock.rw_sem){++++}, at: [<ffffffff810d676b>] percpu_down_write+0x2b/0x110
[  218.083028]  #7:  (sparse_irq_lock){+.+.}, at: [<ffffffff810f22e7>] irq_lock_sparse+0x17/0x20
[  218.083030] 
               stack backtrace:
[  218.083031] CPU: 5 PID: 3282 Comm: rtcwake Tainted: G     U  W       4.14.0-rc2-CI-CI_DRM_3150+ #1
[  218.083032] Hardware name: Intel Corporation CoffeeLake Client Platform/CoffeeLake S UDIMM RVP, BIOS CNLSFWX1.R00.X104.A03.1709140524 09/14/2017
[  218.083032] Call Trace:
[  218.083034]  dump_stack+0x68/0x9f
[  218.083035]  print_circular_bug+0x235/0x3c0
[  218.083036]  ? lockdep_init_map_crosslock+0x20/0x20
[  218.083037]  check_prev_add+0x430/0x840
[  218.083039]  __lock_acquire+0x1420/0x15e0
[  218.083040]  ? __lock_acquire+0x1420/0x15e0
[  218.083041]  ? lockdep_init_map_crosslock+0x20/0x20
[  218.083042]  lock_acquire+0xb0/0x200
[  218.083042]  ? wait_for_completion+0x1d/0x20
[  218.083043]  wait_for_common+0x58/0x210
[  218.083044]  ? wait_for_completion+0x1d/0x20
[  218.083045]  ? cpuhp_invoke_callback+0x840/0x840
[  218.083046]  ? stop_machine_cpuslocked+0xc1/0xd0
[  218.083047]  ? cpuhp_invoke_callback+0x840/0x840
[  218.083048]  wait_for_completion+0x1d/0x20
[  218.083048]  takedown_cpu+0x89/0xf0
[  218.083049]  ? cpuhp_complete_idle_dead+0x20/0x20
[  218.083050]  cpuhp_invoke_callback+0xa3/0x840
[  218.083051]  cpuhp_down_callbacks+0x42/0x80
[  218.083052]  _cpu_down+0xb9/0xf0
[  218.083053]  freeze_secondary_cpus+0xa3/0x390
[  218.083054]  suspend_devices_and_enter+0x2fd/0xce0
[  218.083055]  pm_suspend+0x4f0/0x9d0
[  218.083056]  state_store+0x82/0xf0
[  218.083057]  kobj_attr_store+0xf/0x20
[  218.083058]  sysfs_kf_write+0x45/0x60
[  218.083059]  kernfs_fop_write+0x124/0x1c0
[  218.083060]  __vfs_write+0x28/0x130
[  218.083061]  ? rcu_read_lock_sched_held+0x7a/0x90
[  218.083062]  ? rcu_sync_lockdep_assert+0x2f/0x60
[  218.083063]  ? __sb_start_write+0x108/0x200
[  218.083063]  vfs_write+0xcb/0x1c0
[  218.083064]  SyS_write+0x49/0xb0
[  218.083065]  entry_SYSCALL_64_fastpath+0x1c/0xb1
[  218.083066] RIP: 0033:0x7f1827f6d8f0
[  218.083067] RSP: 002b:00007ffca6397f18 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  218.083067] RAX: ffffffffffffffda RBX: ffffffff81493023 RCX: 00007f1827f6d8f0
[  218.083068] RDX: 0000000000000004 RSI: 000055b4720e4060 RDI: 0000000000000007
[  218.083068] RBP: ffffc9000155bf88 R08: 000055b4720e1dc0 R09: 00007f1828441700
[  218.083069] R10: 00007f1828236b58 R11: 0000000000000246 R12: 000055b4720e1ce0
[  218.083069] R13: 0000000000000001 R14: 0000000000000004 R15: 0000000000000004
[  218.083071]  ? __this_cpu_preempt_check+0x13/0x20
[  218.200997] IRQ 121: no longer affine to CPU8
[  218.233686] IRQ 123: no longer affine to CPU10
[  218.242592] IRQ 1: no longer affine to CPU11
[  218.242599] IRQ 8: no longer affine to CPU11
[  218.242602] IRQ 9: no longer affine to CPU11
[  218.242610] IRQ 120: no longer affine to CPU11
[  218.251026]  cache: parent cpu1 should not be sleeping
[  218.252373]  cache: parent cpu2 should not be sleeping
[  218.253665]  cache: parent cpu3 should not be sleeping
[  218.255003]  cache: parent cpu4 should not be sleeping
[  218.256243]  cache: parent cpu5 should not be sleeping
[  218.257730]  cache: parent cpu6 should not be sleeping
[  218.259156]  cache: parent cpu7 should not be sleeping
[  218.260569]  cache: parent cpu8 should not be sleeping
[  218.262029]  cache: parent cpu9 should not be sleeping
[  218.263492]  cache: parent cpu10 should not be sleeping
[  218.265008]  cache: parent cpu11 should not be sleeping
[  218.446140] sd 3:0:0:0: [sda] Starting disk
[  218.668651] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to enable link training
[  218.684828] atkbd serio0: Failed to deactivate keyboard on isa0060/serio0
[  219.060840] atkbd serio0: Failed to enable keyboard on isa0060/serio0
[  219.354472] Setting dangerous option reset - tainting kernel
[  219.428162] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to enable link training
Comment 2 Marta Löfstedt 2017-10-02 11:27:26 UTC
The lockdep is gone from:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3159/fi-cfl-s/igt@gem_exec_suspend@basic-s3.html

i.e. first test with 4.14.0-rc3.
However, the link training issue remain.

[  242.461748] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to enable link training
[  242.476990] atkbd serio0: Failed to deactivate keyboard on isa0060/serio0
[  242.853071] atkbd serio0: Failed to enable keyboard on isa0060/serio0
[  243.461670] Setting dangerous option reset - tainting kernel
[  243.546478] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to enable link training
Comment 3 Marta Löfstedt 2017-10-10 14:11:23 UTC

*** This bug has been marked as a duplicate of bug 103186 ***

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.