Summary: | [BAT] igt@kms_chamelium@common-hpd-after-suspend caused *ERROR* failed to enable link training | ||
---|---|---|---|
Product: | DRI | Reporter: | Jani Saarinen <jani.saarinen> |
Component: | IGT | Assignee: | Petri Latvala <petri.latvala> |
Status: | RESOLVED MOVED | QA Contact: | |
Severity: | critical | ||
Priority: | high | CC: | elongbug, intel-gfx-bugs, petri.latvala |
Version: | DRI git | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | ReadyForDev | ||
i915 platform: | KBL | i915 features: | power/suspend-resume |
Description
Jani Saarinen
2017-09-01 06:25:44 UTC
Also on: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3116/fi-kbl-7500u/igt@chamelium@dp-crc-fast.html Note on 4.14.0-rc1 kernels, i.e. CI_DRM_3099 we are now seeing a lock-dep before the link training issue: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3099/fi-kbl-7500u/igt@chamelium@common-hpd-after-suspend.html [ 70.297898] ====================================================== [ 70.297898] WARNING: possible circular locking dependency detected [ 70.297900] 4.14.0-rc1-CI-CI_DRM_3099+ #1 Not tainted [ 70.297901] ------------------------------------------------------ [ 70.297901] rtcwake/1459 is trying to acquire lock: [ 70.297902] ((complete)&st->done){+.+.}, at: [<ffffffff8190987d>] wait_for_completion+0x1d/0x20 [ 70.297907] but task is already holding lock: [ 70.297908] (sparse_irq_lock){+.+.}, at: [<ffffffff810f2187>] irq_lock_sparse+0x17/0x20 [ 70.297912] which lock already depends on the new lock. [ 70.297912] the existing dependency chain (in reverse order) is: [ 70.297913] -> #1 (sparse_irq_lock){+.+.}: [ 70.297916] __mutex_lock+0x86/0x9b0 [ 70.297918] mutex_lock_nested+0x1b/0x20 [ 70.297920] irq_lock_sparse+0x17/0x20 [ 70.297921] irq_affinity_online_cpu+0x18/0xd0 [ 70.297923] cpuhp_invoke_callback+0xa3/0x840 [ 70.297924] -> #0 ((complete)&st->done){+.+.}: [ 70.297927] check_prev_add+0x430/0x840 [ 70.297929] __lock_acquire+0x1420/0x15e0 [ 70.297930] lock_acquire+0xb0/0x200 [ 70.297932] wait_for_common+0x58/0x210 [ 70.297933] wait_for_completion+0x1d/0x20 [ 70.297934] takedown_cpu+0x89/0xf0 [ 70.297935] cpuhp_invoke_callback+0xa3/0x840 [ 70.297937] cpuhp_down_callbacks+0x42/0x80 [ 70.297937] _cpu_down+0xb9/0xf0 [ 70.297939] freeze_secondary_cpus+0xa3/0x390 [ 70.297940] suspend_devices_and_enter+0x2fd/0xce0 [ 70.297942] pm_suspend+0x4f0/0x9d0 [ 70.297943] state_store+0x82/0xf0 [ 70.297945] kobj_attr_store+0xf/0x20 [ 70.297947] sysfs_kf_write+0x45/0x60 [ 70.297949] kernfs_fop_write+0x124/0x1c0 [ 70.297950] __vfs_write+0x28/0x130 [ 70.297951] vfs_write+0xcb/0x1c0 [ 70.297953] SyS_write+0x49/0xb0 [ 70.297955] entry_SYSCALL_64_fastpath+0x1c/0xb1 [ 70.297955] other info that might help us debug this: [ 70.297956] Possible unsafe locking scenario: [ 70.297956] CPU0 CPU1 [ 70.297956] ---- ---- [ 70.297957] lock(sparse_irq_lock); [ 70.297958] lock((complete)&st->done); [ 70.297959] lock(sparse_irq_lock); [ 70.297960] lock((complete)&st->done); [ 70.297961] *** DEADLOCK *** [ 70.297963] 8 locks held by rtcwake/1459: [ 70.297963] #0: (sb_writers#5){.+.+}, at: [<ffffffff81220161>] vfs_write+0x171/0x1c0 [ 70.297966] #1: (&of->mutex){+.+.}, at: [<ffffffff812a3302>] kernfs_fop_write+0xf2/0x1c0 [ 70.297970] #2: (kn->count#189){.+.+}, at: [<ffffffff812a330b>] kernfs_fop_write+0xfb/0x1c0 [ 70.297973] #3: (pm_mutex){+.+.}, at: [<ffffffff810e5f49>] pm_suspend+0xa9/0x9d0 [ 70.297976] #4: (acpi_scan_lock){+.+.}, at: [<ffffffff8153b3c7>] acpi_scan_lock_acquire+0x17/0x20 [ 70.297980] #5: (cpu_add_remove_lock){+.+.}, at: [<ffffffff8108106e>] freeze_secondary_cpus+0x2e/0x390 [ 70.297983] #6: (cpu_hotplug_lock.rw_sem){++++}, at: [<ffffffff810d660b>] percpu_down_write+0x2b/0x110 [ 70.297986] #7: (sparse_irq_lock){+.+.}, at: [<ffffffff810f2187>] irq_lock_sparse+0x17/0x20 [ 70.297990] stack backtrace: [ 70.297992] CPU: 2 PID: 1459 Comm: rtcwake Not tainted 4.14.0-rc1-CI-CI_DRM_3099+ #1 [ 70.297993] Hardware name: GIGABYTE GB-BKi7(H)A-7500/MFLP7AP-00, BIOS F4 02/20/2017 [ 70.297993] Call Trace: [ 70.297996] dump_stack+0x68/0x9f [ 70.297998] print_circular_bug+0x235/0x3c0 [ 70.298000] ? lockdep_init_map_crosslock+0x20/0x20 [ 70.298002] check_prev_add+0x430/0x840 [ 70.298004] __lock_acquire+0x1420/0x15e0 [ 70.298006] ? __lock_acquire+0x1420/0x15e0 [ 70.298008] ? lockdep_init_map_crosslock+0x20/0x20 [ 70.298010] lock_acquire+0xb0/0x200 [ 70.298011] ? wait_for_completion+0x1d/0x20 [ 70.298013] wait_for_common+0x58/0x210 [ 70.298014] ? wait_for_completion+0x1d/0x20 [ 70.298015] ? cpuhp_invoke_callback+0x840/0x840 [ 70.298018] ? stop_machine_cpuslocked+0xc1/0xd0 [ 70.298020] ? cpuhp_invoke_callback+0x840/0x840 [ 70.298021] wait_for_completion+0x1d/0x20 [ 70.298022] takedown_cpu+0x89/0xf0 [ 70.298024] ? cpuhp_complete_idle_dead+0x20/0x20 [ 70.298025] cpuhp_invoke_callback+0xa3/0x840 [ 70.298027] cpuhp_down_callbacks+0x42/0x80 [ 70.298028] _cpu_down+0xb9/0xf0 [ 70.298030] freeze_secondary_cpus+0xa3/0x390 [ 70.298032] suspend_devices_and_enter+0x2fd/0xce0 [ 70.298034] pm_suspend+0x4f0/0x9d0 [ 70.298036] state_store+0x82/0xf0 [ 70.298038] kobj_attr_store+0xf/0x20 [ 70.298040] sysfs_kf_write+0x45/0x60 [ 70.298042] kernfs_fop_write+0x124/0x1c0 [ 70.298043] __vfs_write+0x28/0x130 [ 70.298046] ? rcu_read_lock_sched_held+0x7a/0x90 [ 70.298047] ? rcu_sync_lockdep_assert+0x2f/0x60 [ 70.298049] ? __sb_start_write+0x108/0x200 [ 70.298050] vfs_write+0xcb/0x1c0 [ 70.298052] SyS_write+0x49/0xb0 [ 70.298054] entry_SYSCALL_64_fastpath+0x1c/0xb1 [ 70.298055] RIP: 0033:0x7f134a31a290 [ 70.298056] RSP: 002b:00007ffd08e34318 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 70.298058] RAX: ffffffffffffffda RBX: ffffffff81492963 RCX: 00007f134a31a290 [ 70.298059] RDX: 0000000000000004 RSI: 0000000002141060 RDI: 0000000000000006 [ 70.298060] RBP: ffffc9000028ff88 R08: 000000000213edc0 R09: 00007f134a7f6700 [ 70.298061] R10: 0000000000000003 R11: 0000000000000246 R12: 000000000213ece0 [ 70.298061] R13: 0000000000000001 R14: 0000000000000004 R15: 0000000000000004 [ 70.298065] ? __this_cpu_preempt_check+0x13/0x20 [ 70.312646] IRQ 121: no longer affine to CPU2 [ 70.312649] IRQ 122: no longer affine to CPU2 [ 70.312651] IRQ 124: no longer affine to CPU2 [ 70.312658] IRQ 127: no longer affine to CPU2 [ 70.312666] IRQ 130: no longer affine to CPU2 [ 70.319706] IRQ 8: no longer affine to CPU3 [ 70.319713] IRQ 9: no longer affine to CPU3 [ 70.319717] IRQ 120: no longer affine to CPU3 [ 70.319723] IRQ 125: no longer affine to CPU3 [ 70.325294] cache: parent cpu1 should not be sleeping [ 70.326538] cache: parent cpu2 should not be sleeping [ 70.327693] cache: parent cpu3 should not be sleeping [ 70.453203] HDA: we are doing full chip reset now [ 74.251209] Suspending console(s) (use no_console_suspend to debug) [ 74.474977] cache: parent cpu1 should not be sleeping [ 74.476247] cache: parent cpu2 should not be sleeping [ 74.477404] cache: parent cpu3 should not be sleeping [ 74.601393] HDA: we are doing full chip reset now [ 76.606301] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to enable link training Note the lockdep was solved from 4.14.0-rc3 Note from CI_DRM_3369 then chamelium test has changed name to kms_chamelium. https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3369/fi-kbl-7500u/igt@kms_chamelium@common-hpd-after-suspend.html For Maarten to investigate https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3621/fi-kbl-7500u/igt@kms_chamelium@dp-edid-read.html [ 243.927439] [drm:lspcon_wait_mode [i915]] *ERROR* LSPCON mode hasn't settled NOTE I have separated this bug from the current KBL-shard Linktraining bug 103558 and LSPCON bug 103313. First of all. Sorry about spam. This is mass update for our bugs. Sorry if you feel this annoying but with this trying to understand if bug still valid or not. If bug investigation still in progress, please ignore this and I apologize! If you think this is not anymore valid, please comment to the bug that can be closed. If you haven't tested with our latest pre-upstream tree(drm-tip), can you do that also to see if issue is valid there still and if you cannot see issue there, please comment to the bug. GG, any updates here? I got feedback from Imre. To land this patch to upstream I have to add a specific a "Work Around Number" to commit message. So, I requested adding a "Work Around Number" for this issue to BSpec site. And I am still waiting for an action of HW Team. A CI Bug Log filter associated to this bug has been updated: {- KBL Chamelium: *ERROR* failed to enable link training -} {+ CHAMELIUM: igt@kms_chamelium@* - dmesg-warn - *ERROR* failed to enable link training +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5891/re-icl-u/igt@kms_chamelium@common-hpd-after-suspend.html * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5891/re-icl-u/igt@kms_chamelium@dp-hpd-after-suspend.html * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5891/re-icl-u/igt@kms_chamelium@hdmi-hpd-after-suspend.html [code flow of IGT common-hpd-after-suspend] igt-gpu-tools_FDO/tests/kms_chamelium.c common-hpd-after-suspend: test_suspend_resume_hpd_common(..., SUSPEND_STATE_MEM, SUSPEND_TEST_NONE); 1) make all of ports disconnect: reset_state(data, NULL); 2) /* Make sure we notice new connectors after resuming */ try_suspend_resume_hpd(data, NULL, state, test, mon, false); 3) /* Now make sure we notice disconnected connectors after resuming */ try_suspend_resume_hpd(data, NULL, state, test, mon, true); [Connector State] 1) Make an initial state with Disconnected DP: Disconnected HDMI: Disconnected 2) Suspend Resume [scheduled chamelium connector enable: igt_get_autoresume_delay(state) * 1000 / 2: 15 x 1000 / 2 => 7500ms After 7500ms DP : Connected HDMI: Connected If "Hotplug event" will not come after resuming of PM or if the connector state is not "connected" then IGT will complain "Fail". 3) Suspend Resume [scheduled chamelium connector disable: igt_get_autoresume_delay(state) * 1000 / 2: 15 x 1000 / 2 => 7500ms After 7500ms DP : Disconnected HDMI: Disconnected If "Hotplug event" will not come after resuming of PM or if the connector state is not "disconnected" then IGT will complain "Fail". ------------------------------------------------------------------------------- [Below shows [IGT or PM or Link Training or ERROR] searched dmesg.] 6> [96.975976] [IGT] kms_chamelium: executing <6> [97.139345] [IGT] kms_chamelium: starting subtest common-hpd-after-suspend <6> [98.162805] PM: suspend entry (s2idle) <7> [109.379231] [drm:intel_dp_start_link_train [i915]] [CONNECTOR:196:DP-1] Link Training Passed at Link Rate = 162000, Lane count = 4 <6> [110.511525] PM: suspend exit <6> [110.702260] PM: suspend entry (s2idle) While PM mode is going to handle suspend state, i915 received hpd irq ( dp connector is disconnected ) Even PM is on handling of suspend mode, i915 is tried to handle hpd event. <3> [125.489579] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to enable link training <7> [125.489625] [drm:intel_dp_start_link_train [i915]] [CONNECTOR:196:DP-1] Link Training failed at link rate = 162000, lane count = 4 <6> [127.281947] PM: suspend exit <6> [127.292540] [IGT] kms_chamelium: exiting, ret=0 <7> [128.294484] [drm:intel_dp_start_link_train [i915]] [CONNECTOR:196:DP-1] Link Training Passed at Link Rate = 162000, Lane count = 4 According to Imre. "...the same applies here as commented elsewhere: the kms_chamelium@*hpd-after-suspend tests are expected to generate the "failed to enable link training" dmesg error. A CI Bug Log filter associated to this bug has been updated: {- CHAMELIUM: igt@kms_chamelium@* - dmesg-warn - *ERROR* failed to enable link training -} {+ CHAMELIUM: igt@kms_chamelium@* - dmesg-warn - *ERROR* failed to enable link training +} No new failures caught with the new filter Re-opening the bug, but moving it to IGT / Petri as we need the runner to filter out this message! A CI Bug Log filter associated to this bug has been updated: {- CHAMELIUM: igt@kms_chamelium@* - dmesg-warn - *ERROR* failed to enable link training -} {+ CHAMELIUM: igt@kms_chamelium@* - dmesg-warn - *ERROR* failed to enable link training +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6649/fi-icl-u4/igt@kms_chamelium@hdmi-edid-read.html Bug assessment: the message continues to happen in almost every run (83% repro rate), but it seems from the comments that there is no but only an "expected" error message. Leaving severity/priority as is in case Petri Latvala gets to it, otherwise Petri please adjust severity/priority. CC'ing. -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/igt-gpu-tools/issues/4. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.