Bug 102505 - [BAT] igt@kms_chamelium@common-hpd-after-suspend caused *ERROR* failed to enable link training
Summary: [BAT] igt@kms_chamelium@common-hpd-after-suspend caused *ERROR* failed to ena...
Status: RESOLVED MOVED
Alias: None
Product: DRI
Classification: Unclassified
Component: IGT (show other bugs)
Version: DRI git
Hardware: Other All
: high critical
Assignee: Petri Latvala
QA Contact:
URL:
Whiteboard: ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2017-09-01 06:25 UTC by Jani Saarinen
Modified: 2019-11-12 07:37 UTC (History)
3 users (show)

See Also:
i915 platform: KBL
i915 features: power/suspend-resume


Attachments

Description Jani Saarinen 2017-09-01 06:25:44 UTC
On one of the CI systems igt@chamelium@common-hpd-after-suspend caused link error failure.

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3023/fi-kbl-7500u/igt@chamelium@common-hpd-after-suspend.html

Out	
rtcwake: wakeup from "mem" using /dev/rtc0 at Thu Aug 31 18:45:56 2017
rtcwake: wakeup from "mem" using /dev/rtc0 at Thu Aug 31 18:46:08 2017
IGT-Version: 1.19-g5ce65a9a (x86_64) (Linux: 4.13.0-rc7-CI-CI_DRM_3023+ x86_64)
Test requirement not met in function require_connector_present, file chamelium.c:90:
Test requirement: found
No port of type HDMI-A was found
Test requirement not met in function require_connector_present, file chamelium.c:90:
Test requirement: found
No port of type VGA was found
Subtest common-hpd-after-suspend: SUCCESS (8.051s)
Environment	
PIGLIT_SOURCE_DIR="/opt/igt/piglit" PIGLIT_PLATFORM="mixed_glx_egl"
Command	
/opt/igt/tests/chamelium --run-subtest common-hpd-after-suspend
Dmesg	
[   47.584421] Suspending console(s) (use no_console_suspend to debug)
[   47.962821] IRQ 136: no longer affine to CPU1
[   47.982329] IRQ 121: no longer affine to CPU2
[   47.982333] IRQ 122: no longer affine to CPU2
[   47.982336] IRQ 124: no longer affine to CPU2
[   47.982345] IRQ 127: no longer affine to CPU2
[   47.982354] IRQ 130: no longer affine to CPU2
[   47.999454] IRQ 8: no longer affine to CPU3
[   47.999464] IRQ 9: no longer affine to CPU3
[   47.999472] IRQ 120: no longer affine to CPU3
[   47.999481] IRQ 125: no longer affine to CPU3
[   48.006538]  cache: parent cpu1 should not be sleeping
[   48.008168]  cache: parent cpu2 should not be sleeping
[   48.009768]  cache: parent cpu3 should not be sleeping
[   48.642335] done.
[   51.980541] Suspending console(s) (use no_console_suspend to debug)
[   52.241578]  cache: parent cpu1 should not be sleeping
[   52.243203]  cache: parent cpu2 should not be sleeping
[   52.244787]  cache: parent cpu3 should not be sleeping
[   54.414773] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to enable link training
[   54.747735] done.

Relation to https://bugs.freedesktop.org/show_bug.cgi?id=102359 but due to different environment (chamelium) not dupping yet...

fi-kbl-7500u	HD Graphics 620 / GT2	HDMI, mDP->Chamelium
Comment 2 Marta Löfstedt 2017-09-22 07:42:19 UTC
Note on 4.14.0-rc1 kernels, i.e. CI_DRM_3099 we are now seeing a lock-dep before the link training issue:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3099/fi-kbl-7500u/igt@chamelium@common-hpd-after-suspend.html
[   70.297898] ======================================================
[   70.297898] WARNING: possible circular locking dependency detected
[   70.297900] 4.14.0-rc1-CI-CI_DRM_3099+ #1 Not tainted
[   70.297901] ------------------------------------------------------
[   70.297901] rtcwake/1459 is trying to acquire lock:
[   70.297902]  ((complete)&st->done){+.+.}, at: [<ffffffff8190987d>] wait_for_completion+0x1d/0x20
[   70.297907] 
               but task is already holding lock:
[   70.297908]  (sparse_irq_lock){+.+.}, at: [<ffffffff810f2187>] irq_lock_sparse+0x17/0x20
[   70.297912] 
               which lock already depends on the new lock.

[   70.297912] 
               the existing dependency chain (in reverse order) is:
[   70.297913] 
               -> #1 (sparse_irq_lock){+.+.}:
[   70.297916]        __mutex_lock+0x86/0x9b0
[   70.297918]        mutex_lock_nested+0x1b/0x20
[   70.297920]        irq_lock_sparse+0x17/0x20
[   70.297921]        irq_affinity_online_cpu+0x18/0xd0
[   70.297923]        cpuhp_invoke_callback+0xa3/0x840
[   70.297924] 
               -> #0 ((complete)&st->done){+.+.}:
[   70.297927]        check_prev_add+0x430/0x840
[   70.297929]        __lock_acquire+0x1420/0x15e0
[   70.297930]        lock_acquire+0xb0/0x200
[   70.297932]        wait_for_common+0x58/0x210
[   70.297933]        wait_for_completion+0x1d/0x20
[   70.297934]        takedown_cpu+0x89/0xf0
[   70.297935]        cpuhp_invoke_callback+0xa3/0x840
[   70.297937]        cpuhp_down_callbacks+0x42/0x80
[   70.297937]        _cpu_down+0xb9/0xf0
[   70.297939]        freeze_secondary_cpus+0xa3/0x390
[   70.297940]        suspend_devices_and_enter+0x2fd/0xce0
[   70.297942]        pm_suspend+0x4f0/0x9d0
[   70.297943]        state_store+0x82/0xf0
[   70.297945]        kobj_attr_store+0xf/0x20
[   70.297947]        sysfs_kf_write+0x45/0x60
[   70.297949]        kernfs_fop_write+0x124/0x1c0
[   70.297950]        __vfs_write+0x28/0x130
[   70.297951]        vfs_write+0xcb/0x1c0
[   70.297953]        SyS_write+0x49/0xb0
[   70.297955]        entry_SYSCALL_64_fastpath+0x1c/0xb1
[   70.297955] 
               other info that might help us debug this:

[   70.297956]  Possible unsafe locking scenario:

[   70.297956]        CPU0                    CPU1
[   70.297956]        ----                    ----
[   70.297957]   lock(sparse_irq_lock);
[   70.297958]                                lock((complete)&st->done);
[   70.297959]                                lock(sparse_irq_lock);
[   70.297960]   lock((complete)&st->done);
[   70.297961] 
                *** DEADLOCK ***

[   70.297963] 8 locks held by rtcwake/1459:
[   70.297963]  #0:  (sb_writers#5){.+.+}, at: [<ffffffff81220161>] vfs_write+0x171/0x1c0
[   70.297966]  #1:  (&of->mutex){+.+.}, at: [<ffffffff812a3302>] kernfs_fop_write+0xf2/0x1c0
[   70.297970]  #2:  (kn->count#189){.+.+}, at: [<ffffffff812a330b>] kernfs_fop_write+0xfb/0x1c0
[   70.297973]  #3:  (pm_mutex){+.+.}, at: [<ffffffff810e5f49>] pm_suspend+0xa9/0x9d0
[   70.297976]  #4:  (acpi_scan_lock){+.+.}, at: [<ffffffff8153b3c7>] acpi_scan_lock_acquire+0x17/0x20
[   70.297980]  #5:  (cpu_add_remove_lock){+.+.}, at: [<ffffffff8108106e>] freeze_secondary_cpus+0x2e/0x390
[   70.297983]  #6:  (cpu_hotplug_lock.rw_sem){++++}, at: [<ffffffff810d660b>] percpu_down_write+0x2b/0x110
[   70.297986]  #7:  (sparse_irq_lock){+.+.}, at: [<ffffffff810f2187>] irq_lock_sparse+0x17/0x20
[   70.297990] 
               stack backtrace:
[   70.297992] CPU: 2 PID: 1459 Comm: rtcwake Not tainted 4.14.0-rc1-CI-CI_DRM_3099+ #1
[   70.297993] Hardware name: GIGABYTE GB-BKi7(H)A-7500/MFLP7AP-00, BIOS F4 02/20/2017
[   70.297993] Call Trace:
[   70.297996]  dump_stack+0x68/0x9f
[   70.297998]  print_circular_bug+0x235/0x3c0
[   70.298000]  ? lockdep_init_map_crosslock+0x20/0x20
[   70.298002]  check_prev_add+0x430/0x840
[   70.298004]  __lock_acquire+0x1420/0x15e0
[   70.298006]  ? __lock_acquire+0x1420/0x15e0
[   70.298008]  ? lockdep_init_map_crosslock+0x20/0x20
[   70.298010]  lock_acquire+0xb0/0x200
[   70.298011]  ? wait_for_completion+0x1d/0x20
[   70.298013]  wait_for_common+0x58/0x210
[   70.298014]  ? wait_for_completion+0x1d/0x20
[   70.298015]  ? cpuhp_invoke_callback+0x840/0x840
[   70.298018]  ? stop_machine_cpuslocked+0xc1/0xd0
[   70.298020]  ? cpuhp_invoke_callback+0x840/0x840
[   70.298021]  wait_for_completion+0x1d/0x20
[   70.298022]  takedown_cpu+0x89/0xf0
[   70.298024]  ? cpuhp_complete_idle_dead+0x20/0x20
[   70.298025]  cpuhp_invoke_callback+0xa3/0x840
[   70.298027]  cpuhp_down_callbacks+0x42/0x80
[   70.298028]  _cpu_down+0xb9/0xf0
[   70.298030]  freeze_secondary_cpus+0xa3/0x390
[   70.298032]  suspend_devices_and_enter+0x2fd/0xce0
[   70.298034]  pm_suspend+0x4f0/0x9d0
[   70.298036]  state_store+0x82/0xf0
[   70.298038]  kobj_attr_store+0xf/0x20
[   70.298040]  sysfs_kf_write+0x45/0x60
[   70.298042]  kernfs_fop_write+0x124/0x1c0
[   70.298043]  __vfs_write+0x28/0x130
[   70.298046]  ? rcu_read_lock_sched_held+0x7a/0x90
[   70.298047]  ? rcu_sync_lockdep_assert+0x2f/0x60
[   70.298049]  ? __sb_start_write+0x108/0x200
[   70.298050]  vfs_write+0xcb/0x1c0
[   70.298052]  SyS_write+0x49/0xb0
[   70.298054]  entry_SYSCALL_64_fastpath+0x1c/0xb1
[   70.298055] RIP: 0033:0x7f134a31a290
[   70.298056] RSP: 002b:00007ffd08e34318 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[   70.298058] RAX: ffffffffffffffda RBX: ffffffff81492963 RCX: 00007f134a31a290
[   70.298059] RDX: 0000000000000004 RSI: 0000000002141060 RDI: 0000000000000006
[   70.298060] RBP: ffffc9000028ff88 R08: 000000000213edc0 R09: 00007f134a7f6700
[   70.298061] R10: 0000000000000003 R11: 0000000000000246 R12: 000000000213ece0
[   70.298061] R13: 0000000000000001 R14: 0000000000000004 R15: 0000000000000004
[   70.298065]  ? __this_cpu_preempt_check+0x13/0x20
[   70.312646] IRQ 121: no longer affine to CPU2
[   70.312649] IRQ 122: no longer affine to CPU2
[   70.312651] IRQ 124: no longer affine to CPU2
[   70.312658] IRQ 127: no longer affine to CPU2
[   70.312666] IRQ 130: no longer affine to CPU2
[   70.319706] IRQ 8: no longer affine to CPU3
[   70.319713] IRQ 9: no longer affine to CPU3
[   70.319717] IRQ 120: no longer affine to CPU3
[   70.319723] IRQ 125: no longer affine to CPU3
[   70.325294]  cache: parent cpu1 should not be sleeping
[   70.326538]  cache: parent cpu2 should not be sleeping
[   70.327693]  cache: parent cpu3 should not be sleeping
[   70.453203] HDA: we are doing full chip reset now
[   74.251209] Suspending console(s) (use no_console_suspend to debug)
[   74.474977]  cache: parent cpu1 should not be sleeping
[   74.476247]  cache: parent cpu2 should not be sleeping
[   74.477404]  cache: parent cpu3 should not be sleeping
[   74.601393] HDA: we are doing full chip reset now
[   76.606301] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to enable link training
Comment 3 Marta Löfstedt 2017-10-02 11:20:43 UTC
Note the lockdep was solved from 4.14.0-rc3
Comment 5 Marta Löfstedt 2017-11-22 06:24:08 UTC
Note from CI_DRM_3369 then chamelium test has changed name to kms_chamelium.

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3369/fi-kbl-7500u/igt@kms_chamelium@common-hpd-after-suspend.html
Comment 6 Jani Saarinen 2017-11-24 13:05:36 UTC
For Maarten to investigate
Comment 7 Marta Löfstedt 2018-01-11 13:11:32 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3621/fi-kbl-7500u/igt@kms_chamelium@dp-edid-read.html

[  243.927439] [drm:lspcon_wait_mode [i915]] *ERROR* LSPCON mode hasn't settled
Comment 8 Marta Löfstedt 2018-01-11 13:12:27 UTC
NOTE I have separated this bug from the current KBL-shard Linktraining bug 103558 and LSPCON bug 103313.
Comment 10 Jani Saarinen 2018-03-29 07:10:12 UTC
First of all. Sorry about spam.
This is mass update for our bugs. 

Sorry if you feel this annoying but with this trying to understand if bug still valid or not.
If bug investigation still in progress, please ignore this and I apologize!

If you think this is not anymore valid, please comment to the bug that can be closed.
If you haven't tested with our latest pre-upstream tree(drm-tip), can you do that also to see if issue is valid there still and if you cannot see issue there, please comment to the bug.
Comment 11 Lakshmi 2018-10-02 09:38:31 UTC
GG, any updates here?
Comment 12 Mun, Gwan-gyeong 2019-03-01 12:08:07 UTC
I got feedback from Imre.  To land  this patch to upstream I have
to add a specific a "Work Around Number" to commit message.
So, I requested adding a "Work Around Number" for this issue to BSpec site.
And I am still waiting for an action of HW Team.
Comment 13 CI Bug Log 2019-04-10 15:07:15 UTC
A CI Bug Log filter associated to this bug has been updated:

{- KBL Chamelium: *ERROR* failed to enable link training -}
{+ CHAMELIUM: igt@kms_chamelium@* - dmesg-warn - *ERROR* failed to enable link training +}

New failures caught by the filter:

* https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5891/re-icl-u/igt@kms_chamelium@common-hpd-after-suspend.html
* https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5891/re-icl-u/igt@kms_chamelium@dp-hpd-after-suspend.html
* https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5891/re-icl-u/igt@kms_chamelium@hdmi-hpd-after-suspend.html
Comment 14 Mun, Gwan-gyeong 2019-04-24 07:03:04 UTC
[code flow of IGT common-hpd-after-suspend]

igt-gpu-tools_FDO/tests/kms_chamelium.c
  common-hpd-after-suspend: test_suspend_resume_hpd_common(..., SUSPEND_STATE_MEM, SUSPEND_TEST_NONE);
  
  1) make all of ports disconnect: reset_state(data, NULL); 
  2) /* Make sure we notice new connectors after resuming */
     try_suspend_resume_hpd(data, NULL, state, test, mon, false);

  3) /* Now make sure we notice disconnected connectors after resuming */
     try_suspend_resume_hpd(data, NULL, state, test, mon, true);


[Connector State]
 1) Make an initial state with Disconnected
    DP: Disconnected
    HDMI: Disconnected

 2) Suspend
    Resume
      [scheduled chamelium connector enable: igt_get_autoresume_delay(state) * 1000 / 2:  15 x 1000 / 2 => 7500ms
      After 7500ms
      DP  : Connected
      HDMI: Connected

    If "Hotplug event" will not come after resuming of PM or if the connector state is not "connected" then IGT will complain "Fail".
      
 3) Suspend
    Resume
      [scheduled chamelium connector disable: igt_get_autoresume_delay(state) * 1000 / 2:  15 x 1000 / 2 => 7500ms
      After 7500ms
      DP  : Disconnected
      HDMI: Disconnected

    If "Hotplug event" will not come after resuming of PM or if the connector state is not "disconnected" then IGT will complain "Fail".
    

-------------------------------------------------------------------------------
[Below shows [IGT or PM or Link Training or ERROR] searched dmesg.]

6> [96.975976] [IGT] kms_chamelium: executing
<6> [97.139345] [IGT] kms_chamelium: starting subtest common-hpd-after-suspend

<6> [98.162805] PM: suspend entry (s2idle)

<7> [109.379231] [drm:intel_dp_start_link_train [i915]] [CONNECTOR:196:DP-1] Link Training Passed at Link Rate = 162000, Lane count = 4     

<6> [110.511525] PM: suspend exit

<6> [110.702260] PM: suspend entry (s2idle)
   While PM mode is going to handle suspend state, i915 received hpd irq ( dp connector is disconnected )
   Even PM is on handling of suspend mode, i915 is tried to handle hpd event.

<3> [125.489579] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to enable link training
<7> [125.489625] [drm:intel_dp_start_link_train [i915]] [CONNECTOR:196:DP-1] Link Training failed at link rate = 162000, lane count = 4

<6> [127.281947] PM: suspend exit
<6> [127.292540] [IGT] kms_chamelium: exiting, ret=0 
<7> [128.294484] [drm:intel_dp_start_link_train [i915]] [CONNECTOR:196:DP-1] Link Training Passed at Link Rate = 162000, Lane count = 4
Comment 15 Jani Saarinen 2019-06-05 16:20:55 UTC
According to Imre. 
"...the same applies here as commented elsewhere: the kms_chamelium@*hpd-after-suspend tests are expected to generate the "failed to enable link training" dmesg error.
Comment 16 CI Bug Log 2019-06-06 13:08:36 UTC
A CI Bug Log filter associated to this bug has been updated:

{- CHAMELIUM: igt@kms_chamelium@* - dmesg-warn - *ERROR* failed to enable link training -}
{+ CHAMELIUM: igt@kms_chamelium@* - dmesg-warn - *ERROR* failed to enable link training +}


  No new failures caught with the new filter
Comment 17 Martin Peres 2019-07-01 11:04:56 UTC
Re-opening the bug, but moving it to IGT / Petri as we need the runner to filter out this message!
Comment 18 CI Bug Log 2019-08-08 08:51:26 UTC
A CI Bug Log filter associated to this bug has been updated:

{- CHAMELIUM: igt@kms_chamelium@* - dmesg-warn - *ERROR* failed to enable link training -}
{+ CHAMELIUM: igt@kms_chamelium@* - dmesg-warn - *ERROR* failed to enable link training +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6649/fi-icl-u4/igt@kms_chamelium@hdmi-edid-read.html
Comment 19 ashutosh.dixit 2019-11-01 18:28:30 UTC
Bug assessment: the message continues to happen in almost every run (83% repro rate), but it seems from the comments that there is no but only an "expected" error message. Leaving severity/priority as is in case Petri Latvala gets to it, otherwise Petri please adjust severity/priority. CC'ing.
Comment 20 Martin Peres 2019-11-12 07:37:56 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/igt-gpu-tools/issues/4.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.