Bug 104943

Summary: [CI] igt@perf_pmu@rc6 - dmesg-warn - BUG: sleeping function called from invalid context at drivers/base/power/runtime.c:1034 -> BUG: scheduling while atomic: perf_pmu/1513/0x00000004 -> DEBUG_LOCKS_WARN_ON(val > preempt_count())
Product: DRI Reporter: Marta Löfstedt <marta.lofstedt>
Component: DRM/IntelAssignee: Marta Löfstedt <marta.lofstedt>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: intel-gfx-bugs
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard: ReadyForDev
i915 platform: BXT i915 features: Perf/OA

Description Marta Löfstedt 2018-02-05 07:10:31 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4212/shard-apl4/igt@perf_pmu@rc6.html

<3>[  101.309358] BUG: sleeping function called from invalid context at drivers/base/power/runtime.c:1034
<3>[  101.309502] in_atomic(): 1, irqs_disabled(): 1, pid: 1513, name: perf_pmu
<4>[  101.309558] 3 locks held by perf_pmu/1513:
<4>[  101.309563]  #0:  (&cpuctx_mutex){+.+.}, at: [<00000000c3ec2708>] perf_event_ctx_lock_nested+0xbc/0x1d0
<4>[  101.309598]  #1:  (&event->child_mutex){+.+.}, at: [<000000004cd02bc9>] __perf_event_read_value+0x32/0xf0
<4>[  101.309624]  #2:  (&cpuctx_lock){....}, at: [<000000002343d2dc>] __perf_event_read+0x64/0x2e0
<4>[  101.309653] irq event stamp: 25750
<4>[  101.309662] hardirqs last  enabled at (25749): [<00000000c4700020>] entry_SYSCALL_64_fastpath+0x5/0x8f
<4>[  101.309670] hardirqs last disabled at (25750): [<00000000e34beea3>] generic_exec_single+0x90/0xe0
<4>[  101.309678] softirqs last  enabled at (25466): [<00000000bebe2149>] __do_softirq+0x3a1/0x4aa
<4>[  101.309687] softirqs last disabled at (25443): [<00000000062fd6b7>] irq_exit+0xa4/0xb0
<3>[  101.309692] Preemption disabled at:
<4>[  101.309699] [<000000005125cef8>] perf_event_read+0x98/0x2e0
<4>[  101.309741] CPU: 0 PID: 1513 Comm: perf_pmu Tainted: G     U           4.15.0-CI-CI_DRM_3717+ #1
<4>[  101.309747] Hardware name:                  /NUC6CAYB, BIOS AYAPLCEL.86A.0040.2017.0619.1722 06/19/2017
<4>[  101.309752] Call Trace:
<4>[  101.309764]  dump_stack+0x5f/0x86
<4>[  101.309774]  ___might_sleep+0x1d9/0x240
<4>[  101.309788]  __pm_runtime_resume+0x75/0x80
<4>[  101.309870]  intel_runtime_pm_get+0x1c/0xa0 [i915]
<4>[  101.309938]  __i915_pmu_event_read+0x54/0x1c0 [i915]
<4>[  101.310008]  i915_pmu_event_read+0x1e/0x40 [i915]
<4>[  101.310019]  __perf_event_read+0x2b8/0x2e0
<4>[  101.310031]  ? __perf_event_output_stop+0xa0/0xa0
<4>[  101.310040]  generic_exec_single+0x96/0xe0
<4>[  101.310050]  ? __perf_event_output_stop+0xa0/0xa0
<4>[  101.310058]  smp_call_function_single+0x88/0x120
<4>[  101.310076]  ? perf_event_read+0xed/0x2e0
<4>[  101.310083]  ? preempt_count_add+0x44/0x90
<4>[  101.310090]  perf_event_read+0xed/0x2e0
<4>[  101.310106]  __perf_event_read_value+0x3c/0xf0
<4>[  101.310119]  perf_read+0x15e/0x290
<4>[  101.310130]  ? lock_acquire+0xaf/0x200
<4>[  101.310143]  ? free_object+0x6e/0x90
<4>[  101.310156]  __vfs_read+0x1e/0x120
<4>[  101.310166]  ? _raw_spin_unlock_irqrestore+0x39/0x60
<4>[  101.310174]  ? free_object+0x6e/0x90
<4>[  101.310184]  ? entry_SYSCALL_64_fastpath+0x5/0x8f
<4>[  101.310195]  vfs_read+0x9e/0x150
<4>[  101.310206]  SyS_read+0x40/0xa0
<4>[  101.310219]  entry_SYSCALL_64_fastpath+0x22/0x8f
<4>[  101.310226] RIP: 0033:0x7f361c4276d0
<4>[  101.310232] RSP: 002b:00007ffcce301ee8 EFLAGS: 00000246
<3>[  101.311022] BUG: scheduling while atomic: perf_pmu/1513/0x00000004
<4>[  101.311082] 3 locks held by perf_pmu/1513:
<4>[  101.311088]  #0:  (&cpuctx_mutex){+.+.}, at: [<00000000c3ec2708>] perf_event_ctx_lock_nested+0xbc/0x1d0
<4>[  101.311124]  #1:  (&event->child_mutex){+.+.}, at: [<000000004cd02bc9>] __perf_event_read_value+0x32/0xf0
<4>[  101.311152]  #2:  (&cpuctx_lock){+.+.}, at: [<000000002343d2dc>] __perf_event_read+0x64/0x2e0
<4>[  101.311180] Modules linked in: snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel i915 lpc_ich r8169 mii snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm pinctrl_broxton pinctrl_intel mei_me prime_numbers mei
<3>[  101.311314] Preemption disabled at:
<4>[  101.311321] [<000000005125cef8>] perf_event_read+0x98/0x2e0
<4>[  101.311362] CPU: 0 PID: 1513 Comm: perf_pmu Tainted: G     U  W        4.15.0-CI-CI_DRM_3717+ #1
<4>[  101.311368] Hardware name:                  /NUC6CAYB, BIOS AYAPLCEL.86A.0040.2017.0619.1722 06/19/2017
<4>[  101.311374] Call Trace:
<4>[  101.311385]  dump_stack+0x5f/0x86
<4>[  101.311394]  ? perf_event_read+0x98/0x2e0
<4>[  101.311402]  __schedule_bug+0x7a/0xd0
<4>[  101.311413]  __schedule+0x84c/0xaf0
<4>[  101.311424]  ? mark_held_locks+0x64/0x90
<4>[  101.311432]  ? _raw_spin_unlock_irqrestore+0x4c/0x60
<4>[  101.311446]  schedule+0x37/0x90
<4>[  101.311456]  schedule_timeout+0x1da/0x520
<4>[  101.311471]  ? collect_expired_timers+0xa0/0xa0
<4>[  101.311485]  ? msleep+0x1d/0x40
<4>[  101.311497]  ? pci_restore_standard_config+0x40/0x40
<4>[  101.311504]  msleep+0x34/0x40
<4>[  101.311514]  pci_raw_set_power_state+0x18f/0x220
<4>[  101.311526]  pci_set_power_state+0x94/0x110
<4>[  101.311537]  pci_restore_standard_config+0x22/0x40
<4>[  101.311545]  pci_pm_runtime_resume+0x3d/0x90
<4>[  101.311556]  __rpm_callback+0xb1/0x1e0
<4>[  101.311566]  ? pci_restore_standard_config+0x40/0x40
<4>[  101.311577]  rpm_callback+0x1a/0x70
<4>[  101.311586]  ? pci_restore_standard_config+0x40/0x40
<4>[  101.311594]  rpm_resume+0x4d7/0x770
<4>[  101.311615]  __pm_runtime_resume+0x42/0x80
<4>[  101.311700]  intel_runtime_pm_get+0x1c/0xa0 [i915]
<4>[  101.311769]  __i915_pmu_event_read+0x54/0x1c0 [i915]
<4>[  101.311840]  i915_pmu_event_read+0x1e/0x40 [i915]
<4>[  101.311852]  __perf_event_read+0x2b8/0x2e0
<4>[  101.311864]  ? __perf_event_output_stop+0xa0/0xa0
<4>[  101.311874]  generic_exec_single+0x96/0xe0
<4>[  101.311883]  ? __perf_event_output_stop+0xa0/0xa0
<4>[  101.311891]  smp_call_function_single+0x88/0x120
<4>[  101.311910]  ? perf_event_read+0xed/0x2e0
<4>[  101.311916]  ? preempt_count_add+0x44/0x90
<4>[  101.311924]  perf_event_read+0xed/0x2e0
<4>[  101.311940]  __perf_event_read_value+0x3c/0xf0
<4>[  101.311954]  perf_read+0x15e/0x290
<4>[  101.311963]  ? lock_acquire+0xaf/0x200
<4>[  101.311975]  ? free_object+0x6e/0x90
<4>[  101.311989]  __vfs_read+0x1e/0x120
<4>[  101.311999]  ? _raw_spin_unlock_irqrestore+0x39/0x60
<4>[  101.312007]  ? free_object+0x6e/0x90
<4>[  101.312018]  ? entry_SYSCALL_64_fastpath+0x5/0x8f
<4>[  101.312028]  vfs_read+0x9e/0x150
<4>[  101.312040]  SyS_read+0x40/0xa0
<4>[  101.312054]  entry_SYSCALL_64_fastpath+0x22/0x8f
<4>[  101.312060] RIP: 0033:0x7f361c4276d0
<4>[  101.312066] RSP: 002b:00007ffcce301ee8 EFLAGS: 00000246
<7>[  101.322131] [drm:intel_runtime_resume [i915]] Resuming device
<7>[  101.322196] [drm:bxt_disable_dc9 [i915]] Disabling DC9
<7>[  101.322243] [drm:gen9_set_dc_state [i915]] Setting DC state from 08 to 00
<7>[  101.322303] [drm:gen9_set_dc_state [i915]] Setting DC state from 00 to 00
<7>[  101.322433] [drm:intel_power_well_enable [i915]] enabling power well 1
<7>[  101.322587] [drm:intel_dump_cdclk_state [i915]] Current CDCLK 19200 kHz, VCO 0 kHz, ref 19200 kHz, bypass 19200 kHz, voltage level 1
<7>[  101.322636] [drm:bxt_init_cdclk [i915]] Sanitizing cdclk programmed by pre-os
<7>[  101.323942] [drm:gen9_enable_dc5 [i915]] Enabling DC5
<7>[  101.323990] [drm:gen9_set_dc_state [i915]] Setting DC state from 00 to 01
<7>[  101.325816] [drm:__bxt_hpd_detection_setup [i915]] Invert bit setting: hp_ctl:10001818 hp_port:38
<7>[  101.326971] [drm:intel_dp_detect [i915]] [CONNECTOR:65:DP-1]
<7>[  101.327365] [drm:__bxt_hpd_detection_setup [i915]] Invert bit setting: hp_ctl:10001818 hp_port:30
<7>[  101.327980] [drm:intel_runtime_resume [i915]] Device resumed
<7>[  101.328391] [drm:intel_power_well_enable [i915]] enabling always-on
<7>[  101.328437] [drm:intel_power_well_enable [i915]] enabling DC off
<7>[  101.328733] [drm:gen9_set_dc_state [i915]] Setting DC state from 01 to 00
<7>[  101.328794] [drm:intel_power_well_enable [i915]] enabling power well 2
<7>[  101.328931] [drm:intel_power_well_enable [i915]] enabling dpio-common-bc
<4>[  101.329049] ------------[ cut here ]------------
<4>[  101.329065] DEBUG_LOCKS_WARN_ON(val > preempt_count())
<4>[  101.329076] WARNING: CPU: 0 PID: 1513 at kernel/sched/core.c:3122 preempt_count_sub+0x55/0x90
<4>[  101.329103] Modules linked in: snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel i915 lpc_ich r8169 mii snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm pinctrl_broxton pinctrl_intel mei_me prime_numbers mei
<4>[  101.329189] CPU: 0 PID: 1513 Comm: perf_pmu Tainted: G     U  W        4.15.0-CI-CI_DRM_3717+ #1
<4>[  101.329207] Hardware name:                  /NUC6CAYB, BIOS AYAPLCEL.86A.0040.2017.0619.1722 06/19/2017
<4>[  101.329226] RIP: 0010:preempt_count_sub+0x55/0x90
<4>[  101.329237] RSP: 0018:ffffc900003ffcc0 EFLAGS: 00010082
<4>[  101.329250] RAX: 000000000000002a RBX: ffffe8ffffc01dd0 RCX: 0000000000000001
<4>[  101.329264] RDX: 0000000080000001 RSI: 0000000000000001 RDI: ffffffff810eab88
<4>[  101.329279] RBP: ffffc900003ffd00 R08: 0000000000000000 R09: 0000000000000000
<4>[  101.329294] R10: 0000000000000000 R11: ffffffff810eab54 R12: ffffc900003ffd80
<4>[  101.329308] R13: ffffffff8115ce60 R14: 0000000000000001 R15: ffffc900003ffe10
<4>[  101.329323] FS:  00007f361df4fa40(0000) GS:ffff88027fc00000(0000) knlGS:0000000000000000
<4>[  101.329340] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[  101.329353] CR2: 00007f31efcea848 CR3: 000000026cc8a000 CR4: 00000000003406f0
<4>[  101.329367] Call Trace:
<4>[  101.329379]  _raw_spin_unlock+0x29/0x40
<4>[  101.329391]  generic_exec_single+0x96/0xe0
<4>[  101.329405]  ? __perf_event_output_stop+0xa0/0xa0
<4>[  101.329417]  smp_call_function_single+0x88/0x120
<4>[  101.329432]  ? perf_event_read+0xed/0x2e0
<4>[  101.329442]  ? preempt_count_add+0x44/0x90
<4>[  101.329452]  perf_event_read+0xed/0x2e0
<4>[  101.329466]  __perf_event_read_value+0x3c/0xf0
<4>[  101.329479]  perf_read+0x15e/0x290
<4>[  101.329490]  ? lock_acquire+0xaf/0x200
<4>[  101.329503]  ? free_object+0x6e/0x90
<4>[  101.329515]  __vfs_read+0x1e/0x120
<4>[  101.329525]  ? _raw_spin_unlock_irqrestore+0x39/0x60
<4>[  101.329537]  ? free_object+0x6e/0x90
<4>[  101.329548]  ? entry_SYSCALL_64_fastpath+0x5/0x8f
<4>[  101.329560]  vfs_read+0x9e/0x150
<4>[  101.329571]  SyS_read+0x40/0xa0
<4>[  101.329582]  entry_SYSCALL_64_fastpath+0x22/0x8f
<4>[  101.329593] RIP: 0033:0x7f361c4276d0
<4>[  101.329602] RSP: 002b:00007ffcce301ee8 EFLAGS: 00000246
<4>[  101.329610] Code: 62 d5 f6 7e c3 e8 3c 59 3a 00 85 c0 74 20 8b 15 5a 57 62 02 85 d2 75 16 48 c7 c6 d0 4a 07 82 48 c7 c7 2b 5f 06 82 e8 ab 4c fd ff <0f> ff c3 f3 c3 84 d2 75 c7 e8 0d 59 3a 00 85 c0 74 f1 8b 05 2b 
<4>[  101.329744] ---[ end trace ec00e489ea328d9c ]---
Comment 1 Chris Wilson 2018-02-06 14:31:56 UTC
commit b2f78cda260bc6a1a2d382b1d85a29e69b5b3724
Author: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Date:   Mon Feb 5 09:34:48 2018 +0000

    drm/i915/pmu: Fix PMU enable vs execlists tasklet race
    
    Commit 99e48bf98dd0 ("drm/i915: Lock out execlist tasklet while peeking
    inside for busy-stats") added a tasklet_disable call in busy stats
    enabling, but we failed to understand that the PMU enable callback runs
    as an hard IRQ (IPI).
    
    Consequence of this is that the PMU enable callback can interrupt the
    execlists tasklet, and will then deadlock when it calls
    intel_engine_stats_enable->tasklet_disable.
    
    To fix this, I realized it is possible to move the engine stats enablement
    and disablement to PMU event init and destroy hooks. This allows for much
    simpler implementation since those hooks run in normal context (can
    sleep).
    
    v2: Extract engine_event_destroy. (Chris Wilson)
    
    Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Fixes: 99e48bf98dd0 ("drm/i915: Lock out execlist tasklet while peeking inside for busy-stats")
    Testcase: igt/perf_pmu/enable-race-*
    Cc: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Cc: Jani Nikula <jani.nikula@linux.intel.com>
    Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
    Cc: intel-gfx@lists.freedesktop.org
    Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180205093448.13877-1-tvrtko.ursulin@linux.intel.com
Comment 2 Chris Wilson 2018-02-06 14:32:17 UTC
Wrong bug, sorry.
Comment 3 Chris Wilson 2018-02-07 13:49:42 UTC
commit 1fe699e30113ed6f6e853ff44710d256072ea627
Author: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Date:   Tue Feb 6 18:33:11 2018 +0000

    drm/i915/pmu: Fix sleep under atomic in RC6 readout
    
    We are not allowed to call intel_runtime_pm_get from the PMU counter read
    callback since the former can sleep, and the latter is running under IRQ
    context.
    
    To workaround this, we record the last known RC6 and while runtime
    suspended estimate its increase by querying the runtime PM core
    timestamps.
    
    Downside of this approach is that we can temporarily lose a chunk of RC6
    time, from the last PMU read-out to runtime suspend entry, but that will
    eventually catch up, once device comes back online and in the presence of
    PMU queries.
    
    Also, we have to be careful not to overshoot the RC6 estimate, so once
    resumed after a period of approximation, we only update the counter once
    it catches up. With the observation that RC6 is increasing while the
    device is suspended, this should not pose a problem and can only cause
    slight inaccuracies due clock base differences.
    
    v2: Simplify by estimating on top of PM core counters. (Imre)
    
    Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104943
    Fixes: 6060b6aec03c ("drm/i915/pmu: Add RC6 residency metrics")
    Testcase: igt/perf_pmu/rc6-runtime-pm
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Cc: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Imre Deak <imre.deak@intel.com>
    Cc: Jani Nikula <jani.nikula@linux.intel.com>
    Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
    Cc: David Airlie <airlied@linux.ie>
    Cc: intel-gfx@lists.freedesktop.org
    Cc: dri-devel@lists.freedesktop.org
    Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180206183311.17924-1-tvrtko.ursulin@linux.intel.com
Comment 4 Marta Löfstedt 2018-02-07 14:00:13 UTC
Fix included in CI_DRM_3735 I will monitor and close if applicable.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.