Bug 100419

Summary: [BDW][EXT] igt@kms_flip@vblank-vs-dpms-suspend-interruptible hard LOCKUP on cpu 3
Product: DRI Reporter: Tomi Sarvela <tomi.p.sarvela>
Component: DRM/IntelAssignee: Libin Yang <libin.yang>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: blocker    
Priority: highest CC: intel-gfx-bugs, mengdong.lin, vinod.koul
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard: PatchMerged
i915 platform: BDW i915 features: power/runtime PM
Attachments:
Description Flags
patch none

Description Tomi Sarvela 2017-03-27 15:12:22 UTC
Todays DRM-Tip, kconfig close to
https://intel-gfx-ci.01.org/CI/CI_DRM_2413/kernel.config.bz2

Running IGT extended list, BDW NUC panicked in hard lockup.

Trace catched from pstore:

<7>[ 1193.617247] [drm:intel_power_well_disable [i915]] disabling display
<7>[ 1193.617312] [drm:hsw_set_power_well [i915]] Requesting to disable the power well
<7>[ 1193.617354] [drm:verify_connector_state.isra.48 [i915]] [CONNECTOR:53:DP-1]
<7>[ 1193.617387] [drm:intel_atomic_commit_tail [i915]] [CRTC:39:pipe B]
<7>[ 1193.617420] [drm:verify_single_dpll_state.isra.73 [i915]] LCPLL 810
<7>[ 1193.617450] [drm:intel_power_well_disable [i915]] disabling always-on
<6>[ 1193.672616] PM: Syncing filesystems ... done.
<7>[ 1193.673139] PM: Preparing system for sleep (mem)
<6>[ 1193.673599] Freezing user space processes ... (elapsed 0.001 seconds) done.
<6>[ 1193.675270] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
<7>[ 1193.676773] PM: Suspending system (mem)
<4>[ 1193.676914] Suspending console(s) (use no_console_suspend to debug)
<5>[ 1193.679046] sd 0:0:0:0: [sda] Synchronizing SCSI cache
<5>[ 1193.679140] sd 0:0:0:0: [sda] Stopping disk
<6>[ 1193.680224] e1000e: EEE TX LPI TIMER: 00000011
<7>[ 1193.680290] [drm:intel_power_well_enable [i915]] enabling always-on
<7>[ 1193.680320] [drm:intel_power_well_enable [i915]] enabling display
<7>[ 1193.680349] [drm:hsw_set_power_well [i915]] Enabling power well
<7>[ 1193.680735] [drm:intel_atomic_check [i915]] New cdclk calculated to be logical 337500 kHz, actual 337500 kHz
<7>[ 1193.680867] [drm:intel_atomic_commit_tail [i915]] [ENCODER:47:DDI B]
<7>[ 1193.680899] [drm:intel_atomic_commit_tail [i915]] [ENCODER:52:DDI C]
<7>[ 1193.680934] [drm:intel_atomic_commit_tail [i915]] [ENCODER:54:DP-MST A]
<7>[ 1193.680963] [drm:intel_atomic_commit_tail [i915]] [ENCODER:55:DP-MST B]
<7>[ 1193.680992] [drm:intel_atomic_commit_tail [i915]] [ENCODER:56:DP-MST C]
<7>[ 1193.681021] [drm:verify_connector_state.isra.48 [i915]] [CONNECTOR:53:DP-1]
<7>[ 1193.681057] [drm:verify_single_dpll_state.isra.73 [i915]] WRPLL 1
<7>[ 1193.681090] [drm:verify_single_dpll_state.isra.73 [i915]] WRPLL 2
<7>[ 1193.681121] [drm:verify_single_dpll_state.isra.73 [i915]] SPLL
<7>[ 1193.681152] [drm:verify_single_dpll_state.isra.73 [i915]] LCPLL 810
<7>[ 1193.681179] [drm:verify_single_dpll_state.isra.73 [i915]] LCPLL 1350
<7>[ 1193.681206] [drm:verify_single_dpll_state.isra.73 [i915]] LCPLL 2700
<7>[ 1193.681243] [drm:intel_atomic_commit_tail [i915]] [CRTC:39:pipe B]
<0>[ 1211.603046] NMI watchdog: Watchdog detected hard LOCKUP on cpu 3
<4>[ 1211.603047] Modules linked in: snd_hda_intel i915 vgem snd_hda_codec_hdmi x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul snd_hda_codec_realtek snd_hda_codec_generic ghash_clmulni_intel e1000e snd_hda_codec snd_hwdep snd_hda_core snd_pcm ptp mei_me prime_numbers pps_core mei lpc_ich i2c_hid i2c_designware_platform i2c_designware_core [last unloaded: i915]
<4>[ 1211.603053] irq event stamp: 13366
<4>[ 1211.603053] hardirqs last  enabled at (13365): [<ffffffff81891a87>] _raw_spin_unlock_irq+0x27/0x50
<4>[ 1211.603053] hardirqs last disabled at (13366): [<ffffffff818918d2>] _raw_spin_lock_irq+0x12/0x50
<4>[ 1211.603054] softirqs last  enabled at (12744): [<ffffffff81085c79>] __do_softirq+0x1d9/0x4c0
<4>[ 1211.603054] softirqs last disabled at (12721): [<ffffffff810860d9>] irq_exit+0xa9/0xc0
<4>[ 1211.603054] CPU: 3 PID: 10443 Comm: kworker/u8:11 Tainted: G     U          4.11.0-rc4-CI-CI_DRM_319+ #1
<4>[ 1211.603055] Hardware name:                  /NUC5i5RYB, BIOS RYBDWi35.86A.0362.2017.0118.0940 01/18/2017
<4>[ 1211.603055] Workqueue: events_unbound async_run_entry_fn
<4>[ 1211.603056] task: ffff88024cd32740 task.stack: ffffc9000162c000
<4>[ 1211.603056] RIP: 0010:preempt_count_add+0xe/0xc0
<4>[ 1211.603056] RSP: 0018:ffffc9000162fbd8 EFLAGS: 00000082
<4>[ 1211.603057] RAX: 0000000080000001 RBX: 0000000704b96558 RCX: 0000000000000002
<4>[ 1211.603057] RDX: 0000000000000000 RSI: ffffffff81c74f2d RDI: 0000000000000001
<4>[ 1211.603057] RBP: ffffc9000162fc08 R08: 00000000bbcc90cc R09: 23c7b07100000000
<4>[ 1211.603058] R10: ffffffff827901a8 R11: ffff88024cd32740 R12: 0000000704b92baa
<4>[ 1211.603058] R13: 0000000000003ea0 R14: 0000000000000003 R15: ffffffffa00061f0
<4>[ 1211.603058] FS:  0000000000000000(0000) GS:ffff880256d80000(0000) knlGS:0000000000000000
<4>[ 1211.603058] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[ 1211.603059] CR2: 00007f90f84a5144 CR3: 0000000003e0f000 CR4: 00000000003406e0
<4>[ 1211.603059] Call Trace:
<4>[ 1211.603059]  ? delay_tsc+0x3d/0xc0
<4>[ 1211.603059]  __delay+0xa/0x10
<4>[ 1211.603060]  __const_udelay+0x31/0x40
<4>[ 1211.603060]  snd_hdac_bus_stop_cmd_io+0x96/0xe0 [snd_hda_core]
<4>[ 1211.603060]  ? azx_dev_disconnect+0x20/0x20 [snd_hda_intel]
<4>[ 1211.603061]  snd_hdac_bus_stop_chip+0xb1/0x100 [snd_hda_core]
<4>[ 1211.603061]  azx_stop_chip+0x9/0x10 [snd_hda_codec]
<4>[ 1211.603061]  azx_suspend+0x72/0x220 [snd_hda_intel]
<4>[ 1211.603061]  pci_pm_suspend+0x71/0x140
<4>[ 1211.603062]  dpm_run_callback+0x6f/0x330
<4>[ 1211.603062]  ? pci_pm_freeze+0xe0/0xe0
<4>[ 1211.603062]  __device_suspend+0xf9/0x370
<4>[ 1211.603062]  ? dpm_watchdog_set+0x60/0x60
<4>[ 1211.603063]  async_suspend+0x1a/0x90
<4>[ 1211.603063]  async_run_entry_fn+0x34/0x160
<4>[ 1211.603063]  process_one_work+0x1f4/0x6d0
<4>[ 1211.603063]  ? process_one_work+0x16e/0x6d0
<4>[ 1211.603064]  worker_thread+0x49/0x4a0
<4>[ 1211.603064]  kthread+0x107/0x140
<4>[ 1211.603064]  ? process_one_work+0x6d0/0x6d0
<4>[ 1211.603065]  ? kthread_create_on_node+0x40/0x40
<4>[ 1211.603065]  ret_from_fork+0x2e/0x40
<4>[ 1211.603065] Code: 00 48 c7 c7 8d 5d c4 81 c6 05 c3 6e db 00 01 e8 89 2e fd ff eb 96 0f 1f 80 00 00 00 00 8b 15 f2 b2 c5 01 55 65 01 3d 62 08 f6 7e <48> 89 e5 53 89 fb 85 d2 75 11 65 8b 05 51 08 f6 7e 0f b6 c0 3d 
<0>[ 1211.603074] Kernel panic - not syncing: Hard LOCKUP
<4>[ 1211.603074] CPU: 3 PID: 10443 Comm: kworker/u8:11 Tainted: G     U          4.11.0-rc4-CI-CI_DRM_319+ #1
<4>[ 1211.603074] Hardware name:                  /NUC5i5RYB, BIOS RYBDWi35.86A.0362.2017.0118.0940 01/18/2017
<4>[ 1211.603075] Workqueue: events_unbound async_run_entry_fn
<4>[ 1211.603075] Call Trace:
<4>[ 1211.603075]  <NMI>
<4>[ 1211.603076]  dump_stack+0x67/0x92
<4>[ 1211.603076]  panic+0xcf/0x205
<4>[ 1211.603076]  nmi_panic+0x34/0x40
<4>[ 1211.603076]  watchdog_overflow_callback+0xe8/0x130
<4>[ 1211.603077]  __perf_event_overflow+0x5d/0xf0
<4>[ 1211.603077]  perf_event_overflow+0x14/0x20
<4>[ 1211.603077]  intel_pmu_handle_irq+0x1f7/0x580
<4>[ 1211.603077]  perf_event_nmi_handler+0x28/0x50
<4>[ 1211.603078]  ? preempt_count_add+0xe/0xc0
<4>[ 1211.603078]  nmi_handle+0x91/0x2d0
<4>[ 1211.603078]  ? trace_raw_output_nmi_handler+0x70/0x70
<4>[ 1211.603078]  ? preempt_count_add+0xe/0xc0
<4>[ 1211.603079]  ? pci_azx_readw+0x10/0x10 [snd_hda_intel]
<4>[ 1211.603079]  default_do_nmi+0x127/0x1a0
<4>[ 1211.603079]  do_nmi+0x10c/0x150
<4>[ 1211.603079]  end_repeat_nmi+0x1a/0x1e
<4>[ 1211.603080] RIP: 0010:preempt_count_add+0xe/0xc0
<4>[ 1211.603080] RSP: 0018:ffffc9000162fbd8 EFLAGS: 00000082
<4>[ 1211.603080] RAX: 0000000080000001 RBX: 0000000704b96558 RCX: 0000000000000002
<4>[ 1211.603081] RDX: 0000000000000000 RSI: ffffffff81c74f2d RDI: 0000000000000001
<4>[ 1211.603081] RBP: ffffc9000162fc08 R08: 00000000bbcc90cc R09: 23c7b07100000000
<4>[ 1211.603081] R10: ffffffff827901a8 R11: ffff88024cd32740 R12: 0000000704b92baa
<4>[ 1211.603082] R13: 0000000000003ea0 R14: 0000000000000003 R15: ffffffffa00061f0
<4>[ 1211.603082]  ? pci_azx_readw+0x10/0x10 [snd_hda_intel]
<4>[ 1211.603082]  ? preempt_count_add+0xe/0xc0
<4>[ 1211.603082]  ? preempt_count_add+0xe/0xc0
<4>[ 1211.603083]  </NMI>
<4>[ 1211.603083]  ? delay_tsc+0x3d/0xc0
<4>[ 1211.603083]  __delay+0xa/0x10
<4>[ 1211.603083]  __const_udelay+0x31/0x40
<4>[ 1211.603084]  snd_hdac_bus_stop_cmd_io+0x96/0xe0 [snd_hda_core]
<4>[ 1211.603084]  ? azx_dev_disconnect+0x20/0x20 [snd_hda_intel]
<4>[ 1211.603084]  snd_hdac_bus_stop_chip+0xb1/0x100 [snd_hda_core]
<4>[ 1211.603084]  azx_stop_chip+0x9/0x10 [snd_hda_codec]
<4>[ 1211.603085]  azx_suspend+0x72/0x220 [snd_hda_intel]
<4>[ 1211.603085]  pci_pm_suspend+0x71/0x140
<4>[ 1211.603085]  dpm_run_callback+0x6f/0x330
<4>[ 1211.603085]  ? pci_pm_freeze+0xe0/0xe0
<4>[ 1211.603086]  __device_suspend+0xf9/0x370
<4>[ 1211.603086]  ? dpm_watchdog_set+0x60/0x60
<4>[ 1211.603086]  async_suspend+0x1a/0x90
<4>[ 1211.603086]  async_run_entry_fn+0x34/0x160
<4>[ 1211.603087]  process_one_work+0x1f4/0x6d0
<4>[ 1211.603087]  ? process_one_work+0x16e/0x6d0
<4>[ 1211.603087]  worker_thread+0x49/0x4a0
<4>[ 1211.603087]  kthread+0x107/0x140
<4>[ 1211.603088]  ? process_one_work+0x6d0/0x6d0
<4>[ 1211.603088]  ? kthread_create_on_node+0x40/0x40
<4>[ 1211.603088]  ret_from_fork+0x2e/0x40
<0>[ 1211.603112] Kernel Offset: disabled
Comment 1 Chris Wilson 2017-03-27 16:22:31 UTC
*** Bug 100420 has been marked as a duplicate of this bug. ***
Comment 2 Marta Löfstedt 2017-03-28 08:10:21 UTC
This I have duplicated on BDW NUC and am currently discussing with
Comment 3 Marta Löfstedt 2017-03-28 08:20:15 UTC
I assigned this to Libin, Yang since we have already disucced this issue in Jira.
Comment 4 Jani Saarinen 2017-03-28 09:17:29 UTC
Raising priority.
Comment 6 Marta Löfstedt 2017-03-29 07:00:49 UTC
Here is info on the HW: https://intel-gfx-ci.01.org/CI/hardware.html

On the farm this issue is reproducible on Intel NUC5I7RYB. I can also reproduce on NUCi5RYH. 

The drm-tip kernel is here:
https://cgit.freedesktop.org/drm-tip
This issue typically cause suspend-resume tests to not terminate. As we don't have serial connectors on most of the machines in our farms, we use pstore to capture logs that will be available on next reboot. So, what you see in below stacktrace is:
system is going to suspend:
<7>[ 1193.676773] PM: Suspending system (mem)
nmi watchdog discovers lockup:
<0>[ 1211.603046] NMI watchdog: Watchdog detected hard LOCKUP on cpu 3
This will generate a panic which will trigger pstore to store the logs.
Then the system will reboot. After reboot the pstore captured logs will be available in:
/sys/fs/pstore
Note, the pstore logs should be cleaned up between runs.

For this we have added kernel configs:

CONFIG_EFI_VARS_PSTORE=y
CONFIG_BOOTPARAM_HARDLOCKUP_PANIC=y
CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE=1
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=y
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=1
CONFIG_BOOTPARAM_HUNG_TASK_PANIC=y
CONFIG_BOOTPARAM_HUNG_TASK_PANIC_VALUE=1

and kernel boot params:
nmi_watchdog=panic,auto panic=1


To reproduce I suggest setting up the IGT test suit:
https://cgit.freedesktop.org/drm/igt-gpu-tools/
follow the README on how to setup piglit which is the framework we use to run and capture results of the IGT tests. You will probably need to update your: piglit/piglit.conf
by adding:
[igt]
path="path to your igt directory"

The extended.testlist is available in Jira, but I will upload it here as well.

You need to stop your window manager to run the IGT test. On Ubuntu enter virtual terminal:
> sudo systemctl stop lightdm

Now you can run the extended.testlist like this:
> sudo ./piglit/piglit run igt --test-list tests/intel-ci/extended.testlist -o -l verbose -s results

But you may want to write a little script to loop over the execution. The full extended list takes about 6 hours to complete, my guess is that by only running suspend-resume related test the issue would be reproduced faster, but I am not 100% sure that is the case.
Comment 7 Marta Löfstedt 2017-03-29 07:04:39 UTC
FYI, if I blacklist the snd_hda_core, snd_hda_codec and snd_hda_intel I of course doesn't hit the issue, we could consider doing this for the CI system in order to stop being blocked by this issue.
Comment 8 Marta Löfstedt 2017-03-30 05:56:10 UTC
Here is yet another reproduction of the issue:
http://intel-gfx-ci.fi.intel.com/archive/results/CI_IGT_test/CI_DRM_324/f2-bdw-5250u/
Comment 9 Daniel Vetter 2017-03-30 07:38:56 UTC
Just fyi, if audio can't/won't fix this I guess we need to consider removing the sound trees from drm-tip again. I'm not against including them in our trees (it makes a lot of sense), but there's a lot of trouble coming from the audio side. Audio needs to keep up their side of this bargain here.
Comment 10 Jani Nikula 2017-03-30 08:01:25 UTC
(In reply to Marta Löfstedt from comment #7)
> FYI, if I blacklist the snd_hda_core, snd_hda_codec and snd_hda_intel I of
> course doesn't hit the issue, we could consider doing this for the CI system
> in order to stop being blocked by this issue.

(In reply to Daniel Vetter from comment #9)
> Just fyi, if audio can't/won't fix this I guess we need to consider removing
> the sound trees from drm-tip again. I'm not against including them in our
> trees (it makes a lot of sense), but there's a lot of trouble coming from
> the audio side. Audio needs to keep up their side of this bargain here.

IMO blacklisting audio in our CI or excluding the audio tree from drm-tip are self deception, and postponing the inevitable breakage until audio and our trees are merged to Linus' tree. We *never* want to be in a situation where we see the issues at -rc1 and have limited time and options to resolve issues, under pressure. Seeing the issues *now* in CI is a fundamentally good thing, and if there are problems with that, we need to seek other solutions than dodging them.

One thing to do is to involve the audio folks more in graphics CI and graphics bugs. There is value for *them* in what we do here, and they should be interested and contribute back. More collaboration benefits both of us. And there are plenty of Intel folks involved with audio, should not be rocket science to figure this out.

Another, but much bigger, thing is to start gating drm-tip branch updates. Instead of pulling in everything at every drm-tip build, figure out a way to ask for CI runs with good baseline + updated audio tree with specified sha1. If CI says good, update, otherwise escalate with audio.
Comment 11 Libin Yang 2017-03-30 08:44:16 UTC
I will debug on this issue. This is really strange that it shows:
<4>[ 1211.603059]  ? delay_tsc+0x3d/0xc0
<4>[ 1211.603059]  __delay+0xa/0x10
<4>[ 1211.603060]  __const_udelay+0x31/0x40
<4>[ 1211.603060]  snd_hdac_bus_stop_cmd_io+0x96/0xe0 [snd_hda_core]

From my knowledge snd_hdac_bus_stop_cmd_io() will not cause such issue. So my opinion is that maybe audio use udelay while delay_tsc() cause this issue (which is the last dump of the oops). Maybe the tsc is off when audio is still waiting for its response? What's your suggestion? If so, I think we can ask kernel team to involve?
Comment 12 Marta Löfstedt 2017-04-10 06:35:44 UTC
I reverted:
 commit 38b19ed7f81ec930f3ad2066ae088f574970c814
Author: Jeeja KP <jeeja.kp@intel.com>
Date:   Thu May 5 11:24:43 2016 +0530

    ALSA: hda: fix to wait for RIRB & CORB DMA to set
    
    If the DMAs are not being quiesced properly, it may lead to
    stability issues, so the recommendation is to wait till DMAs are
    stopped.
    
    After setting the stop bit of RIRB/CORB DMA, we should wait for
    stop bit to be set.

and then I can't reproduce the issue.
Comment 13 Marta Löfstedt 2017-04-12 08:01:07 UTC
For more information on the patch I reverted. The patch introduce a potential 200ms busy-wait when interrupts are turned off. Also, if the state isn't updated with-in those 200 ms, nothing is done. 
I believe this patch should be rewritten by audio developers.
Comment 14 Imre Deak 2017-04-12 08:48:22 UTC
(In reply to Marta Löfstedt from comment #13)
> For more information on the patch I reverted. The patch introduce a
> potential 200ms busy-wait when interrupts are turned off. Also, if the state
> isn't updated with-in those 200 ms, nothing is done. 
> I believe this patch should be rewritten by audio developers.

Agreed, using jiffies in hdac_wait_for_cmd_dmas() to determine when to time out when interrupts are off (snd_hdac_bus_stop_cmd_io()/spin_lock_irq()) is bogus.
Comment 15 Libin Yang 2017-04-12 14:38:25 UTC
Agree with Imre. I will ask Jeeja to fix this bug.
Comment 16 Marta Löfstedt 2017-04-27 06:22:51 UTC
I have been exploring the possibility that Imres patch:

https://patchwork.freedesktop.org/patch/149696/

could affect this issue. I.e. if I couldn't reproduce this issue with Imres patch applied, we could close this issue. However, yesterdays nightly run on my BDW i5NUC, with: drm-tip: 2017y-04m-25d-14h-42m-59s UTC integration manifest, panicked out and rebooted with the following backtrace in pstore logs:


<6>[ 4498.879228] PM: Syncing filesystems ... done.
<7>[ 4498.881857] PM: Preparing system for sleep (mem)
<6>[ 4498.882336] Freezing user space processes ... (elapsed 0.001 seconds) done.
<6>[ 4498.883833] Freezing remaining freezable tasks ... (elapsed 0.000 seconds) done.
<7>[ 4498.884610] PM: Suspending system (mem)
<4>[ 4498.884642] Suspending console(s) (use no_console_suspend to debug)
<5>[ 4498.907035] sd 0:0:0:0: [sda] Synchronizing SCSI cache
<6>[ 4498.907619] nuvoton-cir 00:01: disabled
<6>[ 4498.907677] e1000e: EEE TX LPI TIMER: 00000011
<5>[ 4498.909626] sd 0:0:0:0: [sda] Stopping disk
<4>[ 4516.550916] NMI backtrace for cpu 1 skipped: idling at pc 0xffffffff8dc99f0c
<4>[ 4516.550925] NMI backtrace for cpu 2 skipped: idling at pc 0xffffffff8dc99f0c
<4>[ 4516.550926] NMI backtrace for cpu 0 skipped: idling at pc 0xffffffff8dc99f0c
<0>[ 4516.551893] NMI watchdog: Watchdog detected hard LOCKUP on cpu 3
<4>[ 4516.551894] Modules linked in: rfcomm bnep arc4 binfmt_misc nls_iso8859_1 iwlmvm intel_rapl x86_pkg_temp_thermal intel_p\
owerclamp mac80211 coretemp crct10dif_pclmul crc32_pclmul snd_hda_codec_hdmi ghash_clmulni_intel snd_hda_codec_realtek snd_soc\
_rt5640 snd_hda_codec_generic aesni_intel snd_soc_ssm4567 snd_soc_rl6231 snd_soc_core snd_hda_intel btusb aes_x86_64 crypto_si\
md cryptd iwlwifi glue_helper snd_hda_codec btrtl intel_cstate snd_hda_core snd_compress btbcm cfg80211 intel_rapl_perf btinte\
l bluetooth mei_me input_leds snd_pcm snd_seq_midi snd_seq_midi_event mei snd_rawmidi snd_hwdep shpchp lpc_ich snd_seq snd_tim\
er rc_rc6_mce ir_lirc_codec lirc_dev snd_seq_device nuvoton_cir snd rc_core elan_i2c acpi_als soundcore kfifo_buf snd_soc_sst_\
acpi i2c_designware_platform snd_soc_sst_match dw_dmac spi_pxa2xx_platform
<4>[ 4516.551903]  industrialio acpi_pad i2c_designware_core 8250_dw mac_hid parport_pc ppdev lp parport ip_tables x_tables au\
tofs4 hid_generic i915 usbhid i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm e1000e ptp ahci pp\
s_core libahci sdhci_acpi video sdhci i2c_hid hid
<4>[ 4516.551908] CPU: 3 PID: 9028 Comm: kworker/u8:12 Tainted: G     U          4.11.0-rc8+ #36
<4>[ 4516.551908] Hardware name:                  /NUC5i5RYB, BIOS RYBDWi35.86A.0249.2015.0529.1640 05/29/2015
<4>[ 4516.551908] Workqueue: events_unbound async_run_entry_fn
<4>[ 4516.551909] task: ffffa0d38c2f0e40 task.stack: ffffc013427d4000
<4>[ 4516.551909] RIP: 0010:delay_tsc+0x39/0x60
<4>[ 4516.551909] RSP: 0018:ffffc013427d7ca8 EFLAGS: 00000006
<4>[ 4516.551910] RAX: 0000000ba3bd3924 RBX: ffffa0d38f81d800 RCX: 0000000000002a80
<4>[ 4516.551910] RDX: 0000000b00000000 RSI: 0000000ba3bd0ea4 RDI: 0000000000003de0
<4>[ 4516.551910] RBP: ffffc013427d7ca8 R08: 0000000000000003 R09: 0000000000000000
<4>[ 4516.551911] R10: ffffc01340ce7d88 R11: 0000000000000000 R12: ffffa0d38f81dc70
<4>[ 4516.551911] R13: 00000001001004be R14: ffffc013415300c0 R15: ffffffffc0630190
<4>[ 4516.551911] FS:  0000000000000000(0000) GS:ffffa0d39ed80000(0000) knlGS:0000000000000000
<4>[ 4516.551912] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[ 4516.551912] CR2: 00007f14733afab0 CR3: 0000000027409000 CR4: 00000000003406e0
<4>[ 4516.551912] Call Trace:
<4>[ 4516.551912]  __const_udelay+0x38/0x40
<4>[ 4516.551913]  snd_hdac_bus_stop_cmd_io+0x9b/0xf0 [snd_hda_core]
<4>[ 4516.551913]  ? azx_dev_disconnect+0x20/0x20 [snd_hda_intel]
<4>[ 4516.551913]  snd_hdac_bus_stop_chip+0xb6/0x100 [snd_hda_core]
<4>[ 4516.551913]  azx_stop_chip+0xe/0x10 [snd_hda_codec]
<4>[ 4516.551914]  azx_suspend+0x77/0x120 [snd_hda_intel]
<4>[ 4516.551914]  pci_pm_suspend+0x78/0x150
<4>[ 4516.551914]  dpm_run_callback+0x56/0x140
<4>[ 4516.551914]  ? pci_pm_freeze+0xf0/0xf0
<4>[ 4516.551914]  __device_suspend+0x11d/0x390
<4>[ 4516.551915]  async_suspend+0x1f/0xa0
<4>[ 4516.551915]  async_run_entry_fn+0x37/0x150
<4>[ 4516.551915]  process_one_work+0x1fc/0x490
<4>[ 4516.551915]  worker_thread+0x4b/0x500
<4>[ 4516.551916]  kthread+0x109/0x140
<4>[ 4516.551916]  ? process_one_work+0x490/0x490
<4>[ 4516.551916]  ? kthread_create_on_node+0x70/0x70
<4>[ 4516.551916]  ret_from_fork+0x2c/0x40
<4>[ 4516.551916] Code: 09 7d 72 0f ae e8 0f 31 48 c1 e2 20 48 89 d6 48 09 c6 0f ae e8 0f 31 48 c1 e2 20 48 09 d0 48 89 c1 48 \
29 f1 48 39 cf 76 19 f3 90 <65> 8b 0d a8 09 7d 72 41 39 c8 74 db 48 29 c6 41 89 c8 48 01 f7
<6>[ 4516.551924] Sending NMI from CPU 3 to CPUs 0-2:
<0>[ 4516.551988] Kernel panic - not syncing: Hard LOCKUP
<4>[ 4516.551988] CPU: 3 PID: 9028 Comm: kworker/u8:12 Tainted: G     U          4.11.0-rc8+ #36
<4>[ 4516.551989] Hardware name:                  /NUC5i5RYB, BIOS RYBDWi35.86A.0249.2015.0529.1640 05/29/2015
<4>[ 4516.551989] Workqueue: events_unbound async_run_entry_fn
<4>[ 4516.551989] Call Trace:
<4>[ 4516.551989]  <NMI>
<4>[ 4516.551990]  dump_stack+0x63/0x81
<4>[ 4516.551990]  panic+0xe4/0x22d
<4>[ 4516.551990]  ? find_next_bit+0xb/0x10
<4>[ 4516.551990]  nmi_panic+0x39/0x40
<4>[ 4516.551991]  watchdog_overflow_callback+0xbc/0xd0
<4>[ 4516.551991]  __perf_event_overflow+0x54/0xf0
<4>[ 4516.551991]  perf_event_overflow+0x14/0x20
<4>[ 4516.551991]  intel_pmu_handle_irq+0x229/0x4f0
<4>[ 4516.551992]  ? pci_azx_readw+0x10/0x10 [snd_hda_intel]
<4>[ 4516.551992]  perf_event_nmi_handler+0x2c/0x50
<4>[ 4516.551992]  ? sched_clock+0x9/0x10
<4>[ 4516.551992]  ? sched_clock+0x9/0x10
<4>[ 4516.551992]  ? perf_event_nmi_handler+0x2c/0x50
<4>[ 4516.551993]  nmi_handle+0x6a/0x120
<4>[ 4516.551993]  ? pci_azx_readw+0x10/0x10 [snd_hda_intel]
<4>[ 4516.551993]  default_do_nmi+0x53/0x110
<4>[ 4516.551993]  do_nmi+0xec/0x140
<4>[ 4516.551993]  end_repeat_nmi+0x1a/0x1e
<4>[ 4516.551994] RIP: 0010:delay_tsc+0x39/0x60
<4>[ 4516.551994] RSP: 0018:ffffc013427d7ca8 EFLAGS: 00000006
<4>[ 4516.551994] RAX: 0000000ba3bd3924 RBX: ffffa0d38f81d800 RCX: 0000000000002a80
<4>[ 4516.551995] RDX: 0000000b00000000 RSI: 0000000ba3bd0ea4 RDI: 0000000000003de0
<4>[ 4516.551995] RBP: ffffc013427d7ca8 R08: 0000000000000003 R09: 0000000000000000
<4>[ 4516.551995] R10: ffffc01340ce7d88 R11: 0000000000000000 R12: ffffa0d38f81dc70
<4>[ 4516.551996] R13: 00000001001004be R14: ffffc013415300c0 R15: ffffffffc0630190
<4>[ 4516.551996]  ? pci_azx_readw+0x10/0x10 [snd_hda_intel]
<4>[ 4516.551996]  ? delay_tsc+0x39/0x60
<4>[ 4516.551996]  ? delay_tsc+0x39/0x60
<4>[ 4516.551996]  </NMI>
<4>[ 4516.551997]  __const_udelay+0x38/0x40
<4>[ 4516.551997]  snd_hdac_bus_stop_cmd_io+0x9b/0xf0 [snd_hda_core]
<4>[ 4516.551997]  ? azx_dev_disconnect+0x20/0x20 [snd_hda_intel]
<4>[ 4516.551997]  snd_hdac_bus_stop_chip+0xb6/0x100 [snd_hda_core]
<4>[ 4516.551998]  azx_stop_chip+0xe/0x10 [snd_hda_codec]
<4>[ 4516.551998]  azx_suspend+0x77/0x120 [snd_hda_intel]
<4>[ 4516.551998]  pci_pm_suspend+0x78/0x150
<4>[ 4516.551998]  dpm_run_callback+0x56/0x140
<4>[ 4516.551998]  ? pci_pm_freeze+0xf0/0xf0
<4>[ 4516.551999]  __device_suspend+0x11d/0x390
<4>[ 4516.551999]  async_suspend+0x1f/0xa0
<4>[ 4516.551999]  async_run_entry_fn+0x37/0x150
<4>[ 4516.551999]  process_one_work+0x1fc/0x490
<4>[ 4516.551999]  worker_thread+0x4b/0x500
<4>[ 4516.552000]  kthread+0x109/0x140
<4>[ 4516.552000]  ? process_one_work+0x490/0x490
<4>[ 4516.552000]  ? kthread_create_on_node+0x70/0x70
<4>[ 4516.552000]  ret_from_fork+0x2c/0x40
<0>[ 4516.552025] Kernel Offset: 0xc400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
Comment 17 Marta Löfstedt 2017-05-04 06:54:42 UTC
This issue is still negatively affecting the coverage of the IGT extended list for BDW.
  
Here are data for some new reproductions:
http://intel-gfx-ci.fi.intel.com/archive/results/CI_IGT_test/CI_DRM_382/f2-bdw-5250u/

http://intel-gfx-ci.fi.intel.com/archive/results/CI_IGT_test/CI_DRM_382/f2-bdw-5557u

http://intel-gfx-ci.fi.intel.com/archive/results/CI_IGT_test/CI_DRM_383/f2-bdw-5250u/

Also, note that this is after Imres patch-set that I discussed earlier was moved. So, this proves that Imres patch-set does not affect this bug.
Comment 18 jeeja 2017-05-04 09:51:46 UTC
Created attachment 131196 [details] [review]
patch
Comment 19 jeeja 2017-05-04 09:53:07 UTC
Marta, can you check with the attached patch.
Comment 20 Jani Nikula 2017-05-04 10:17:47 UTC
(In reply to jeeja from comment #18)
> Created attachment 131196 [details] [review]
> patch

No matter what the fix ends up being, please add these:

Fixes: 38b19ed7f81e ("ALSA: hda: fix to wait for RIRB & CORB DMA to set")
Cc: Jeeja KP <jeeja.kp@intel.com>
Cc: Vinod Koul <vinod.koul@intel.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: <stable@vger.kernel.org> # v4.7+
Comment 21 Marta Löfstedt 2017-05-04 10:18:20 UTC
The patch is identical to  sent in email on 5 of April.
I will test it again...
Comment 22 Jani Saarinen 2017-05-04 10:30:45 UTC
reference to patch from Chris:
https://lists.freedesktop.org/archives/intel-gfx/2017-May/127230.html
Comment 23 Marta Löfstedt 2017-05-05 06:14:34 UTC
I have tested Jeejas patch OK.
If audio devs want Chris patch I suggest they validate that one.
Comment 24 Marta Löfstedt 2017-05-05 06:29:42 UTC
Quote from email:
Takashi prefers dropping the locks
http://mid.mail-archive.com/s5h1ss4q5gw.wl-tiwai@suse.de
Comment 25 Ricardo 2017-05-09 16:53:22 UTC
Adding tag into "Whiteboard" field - ReadyForDev
The bug still active
*Status is correct
*Platform is included
*Feature is included
*Priority and Severity correctly set
*Logs included
Comment 26 Marta Löfstedt 2017-05-10 10:23:00 UTC
commit 960013762df0a214b57f2fce655422fb52bdfd2c
Author: Jeeja KP <jeeja.kp@intel.com>
Date:   Wed May 10 11:51:58 2017 +0530

    ALSA: hda: Fix cpu lockup when stopping the cmd dmas

Is now merged to drm-tip, but it is still not upstreamed. 

I have not been able to reproduce the audio issue on my BDW.
So. let's hope this good result also comes though on CI extended testing.
Comment 27 jeeja 2017-05-10 10:31:37 UTC
patch is posted and applied. 
https://patchwork.kernel.org/patch/9719241/
Comment 28 Marta Löfstedt 2017-05-17 06:18:40 UTC
This can now be closed since the fix is in drm-tip.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.