Bug 104697 - [CI] igt@perf_pmu@busy-double-start-bcs0 - Incomplete: watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [perf_pmu:1609]
Summary: [CI] igt@perf_pmu@busy-double-start-bcs0 - Incomplete: watchdog: BUG: soft lo...
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2018-01-19 09:12 UTC by Marta Löfstedt
Modified: 2018-02-07 06:48 UTC (History)
1 user (show)

See Also:
i915 platform: BXT
i915 features: Perf/OA


Attachments

Description Marta Löfstedt 2018-01-19 09:12:55 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3645/shard-apl4/igt@perf_pmu@busy-double-start-bcs0.html

From pstore:

<0>[  128.657708] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [perf_pmu:1609]
<4>[  128.657748] Modules linked in: snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal coretemp crct10dif_pclmul i915 crc32_pclmul ghash_clmulni_intel snd_hda_intel snd_hda_codec lpc_ich snd_hwdep snd_hda_core r8169 snd_pcm mii mei_me mei prime_numbers pinctrl_broxton pinctrl_intel
<4>[  128.657820] irq event stamp: 289026
<4>[  128.657830] hardirqs last  enabled at (289025): [<00000000ff02e880>] restore_regs_and_return_to_kernel+0x0/0x1f
<4>[  128.657835] hardirqs last disabled at (289026): [<00000000a33c5663>] apic_timer_interrupt+0x9a/0xb0
<4>[  128.657839] softirqs last  enabled at (289024): [<00000000ecbdbd47>] __do_softirq+0x3a1/0x4aa
<4>[  128.657845] softirqs last disabled at (289017): [<00000000da21e243>] irq_exit+0xa4/0xb0
<4>[  128.657850] CPU: 1 PID: 1609 Comm: perf_pmu Tainted: G     U           4.15.0-rc8-CI-CI_DRM_3645+ #1
<4>[  128.657853] Hardware name:                  /NUC6CAYB, BIOS AYAPLCEL.86A.0040.2017.0619.1722 06/19/2017
<4>[  128.657859] RIP: 0010:smp_call_function_single+0x93/0x120
<4>[  128.657862] RSP: 0018:ffffc90000437da0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff11
<4>[  128.657869] RAX: 0000000000000003 RBX: 0000000000000000 RCX: 0000000000000006
<4>[  128.657872] RDX: 0000000000000001 RSI: ffffffff82108511 RDI: ffffffff820b9866
<4>[  128.657875] RBP: ffffc90000437df0 R08: 0000000000000000 R09: 0000000000000001
<4>[  128.657878] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff81160ac0
<4>[  128.657881] R13: ffffc90000437e18 R14: 0000000000000001 R15: 0000000000000005
<4>[  128.657885] FS:  00007ff11e7dfa40(0000) GS:ffff88027fc80000(0000) knlGS:0000000000000000
<4>[  128.657888] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[  128.657892] CR2: 00007ff11e8aa000 CR3: 000000026adb0000 CR4: 00000000003406e0
<4>[  128.657895] Call Trace:
<4>[  128.657903]  ? perf_duration_warn+0x40/0x40
<4>[  128.657912]  ? perf_install_in_context+0x132/0x140
<4>[  128.657917]  perf_install_in_context+0x132/0x140
<4>[  128.657922]  ? __perf_event_enable+0x1c0/0x1c0
<4>[  128.657930]  SyS_perf_event_open+0x779/0x11d0
<4>[  128.657952]  entry_SYSCALL_64_fastpath+0x22/0x8f
<4>[  128.657956] RIP: 0033:0x7ff11c9e29f9
<4>[  128.657959] RSP: 002b:00007ffd261d6778 EFLAGS: 00000246
<4>[  128.657975] Code: 01 73 07 9c 58 f6 c4 02 74 53 45 85 f6 74 5f 48 8d 75 b0 89 df 4c 89 e9 4c 89 e2 e8 98 fe ff ff 89 c3 8b 45 c8 a8 01 74 0a f3 90 <8b> 55 c8 83 e2 01 75 f6 bf 01 00 00 00 e8 db c3 f8 ff 65 8b 05 
<0>[  128.658157] Kernel panic - not syncing: softlockup: hung tasks
<4>[  128.658185] CPU: 1 PID: 1609 Comm: perf_pmu Tainted: G     U       L   4.15.0-rc8-CI-CI_DRM_3645+ #1
<4>[  128.658222] Hardware name:                  /NUC6CAYB, BIOS AYAPLCEL.86A.0040.2017.0619.1722 06/19/2017
<4>[  128.658261] Call Trace:
<4>[  128.658275]  <IRQ>
<4>[  128.658288]  dump_stack+0x5f/0x86
<4>[  128.658307]  panic+0xcf/0x20d
<4>[  128.658333]  watchdog_timer_fn+0x278/0x280
<4>[  128.658356]  ? watchdog+0x30/0x30
<4>[  128.658374]  __hrtimer_run_queues+0xdf/0x4d0
<4>[  128.658401]  hrtimer_interrupt+0xb6/0x220
<4>[  128.658424]  ? perf_duration_warn+0x40/0x40
<4>[  128.658446]  smp_apic_timer_interrupt+0x6f/0x2d0
<4>[  128.658469]  apic_timer_interrupt+0x9f/0xb0
<4>[  128.658490]  </IRQ>
<4>[  128.658503] RIP: 0010:smp_call_function_single+0x93/0x120
<4>[  128.658526] RSP: 0018:ffffc90000437da0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff11
<4>[  128.658562] RAX: 0000000000000003 RBX: 0000000000000000 RCX: 0000000000000006
<4>[  128.658592] RDX: 0000000000000001 RSI: ffffffff82108511 RDI: ffffffff820b9866
<4>[  128.658622] RBP: ffffc90000437df0 R08: 0000000000000000 R09: 0000000000000001
<4>[  128.658652] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff81160ac0
<4>[  128.658682] R13: ffffc90000437e18 R14: 0000000000000001 R15: 0000000000000005
<4>[  128.658714]  ? perf_duration_warn+0x40/0x40
<4>[  128.658744]  ? perf_duration_warn+0x40/0x40
<4>[  128.658770]  ? perf_install_in_context+0x132/0x140
<4>[  128.658793]  perf_install_in_context+0x132/0x140
<4>[  128.658816]  ? __perf_event_enable+0x1c0/0x1c0
<4>[  128.658841]  SyS_perf_event_open+0x779/0x11d0
<4>[  128.658879]  entry_SYSCALL_64_fastpath+0x22/0x8f
<4>[  128.658901] RIP: 0033:0x7ff11c9e29f9
<4>[  128.658918] RSP: 002b:00007ffd261d6778 EFLAGS: 00000246
<0>[  129.779249] Shutting down cpus with NMI
<0>[  129.779300] Dumping ftrace buffer:
<0>[  129.779446] ---------------------------------
<0>[  129.779557] modprobe-276     2..s1 4830458us : execlists_submission_tasklet: rcs0 in[0]:  ctx=2.1, seqno=1
Comment 1 Marta Löfstedt 2018-01-19 09:17:21 UTC
This looks related:
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4154/shard-apl3/igt@perf_pmu@busy-double-start-bcs0.html

from pstore:
<0>[  124.781813] watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [thermald:578]
Comment 2 Chris Wilson 2018-02-06 14:33:14 UTC
commit b2f78cda260bc6a1a2d382b1d85a29e69b5b3724
Author: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Date:   Mon Feb 5 09:34:48 2018 +0000

    drm/i915/pmu: Fix PMU enable vs execlists tasklet race
    
    Commit 99e48bf98dd0 ("drm/i915: Lock out execlist tasklet while peeking
    inside for busy-stats") added a tasklet_disable call in busy stats
    enabling, but we failed to understand that the PMU enable callback runs
    as an hard IRQ (IPI).
    
    Consequence of this is that the PMU enable callback can interrupt the
    execlists tasklet, and will then deadlock when it calls
    intel_engine_stats_enable->tasklet_disable.
    
    To fix this, I realized it is possible to move the engine stats enablement
    and disablement to PMU event init and destroy hooks. This allows for much
    simpler implementation since those hooks run in normal context (can
    sleep).
    
    v2: Extract engine_event_destroy. (Chris Wilson)
    
    Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Fixes: 99e48bf98dd0 ("drm/i915: Lock out execlist tasklet while peeking inside for busy-stats")
    Testcase: igt/perf_pmu/enable-race-*
    Cc: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Cc: Jani Nikula <jani.nikula@linux.intel.com>
    Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
    Cc: intel-gfx@lists.freedesktop.org
    Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180205093448.13877-1-tvrtko.ursulin@linux.intel.com
Comment 3 Marta Löfstedt 2018-02-07 06:48:34 UTC
Fix integrated to CI_DRM_3728.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.