Bug 109513 - [CI][BAT] igt@pm_rpm@module-reload - dmesg-warn - Use count on domain AUDIO is already zero
Summary: [CI][BAT] igt@pm_rpm@module-reload - dmesg-warn - Use count on domain AUDIO i...
Status: RESOLVED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: Other All
: high normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2019-01-31 08:33 UTC by Martin Peres
Modified: 2019-04-17 14:42 UTC (History)
2 users (show)

See Also:
i915 platform: GLK, ICL, SKL
i915 features: display/audio


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Peres 2019-01-31 08:33:06 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5511/fi-skl-6700hq/igt@pm_rpm@module-reload.html

<4> [392.459226] ------------[ cut here ]------------
<4> [392.459263] Use count on domain AUDIO is already zero
<7> [392.459275] i915 Wakeref last acquired:
   intel_display_power_get+0x18/0x50 [i915]
   snd_hdac_display_power+0x5e/0xf0 [snd_hda_core]
   azx_probe_work+0x81/0x7e0 [snd_hda_intel]
   process_one_work+0x245/0x610
   worker_thread+0x37/0x380
   kthread+0x119/0x130
   ret_from_fork+0x3a/0x50
<4> [392.459325] WARNING: CPU: 5 PID: 68 at drivers/gpu/drm/i915/intel_runtime_pm.c:1928 __intel_display_power_put+0xe9/0x100 [i915]
<4> [392.459328] Modules linked in: snd_hda_intel i915 vgem snd_hda_codec_hdmi snd_hda_codec_generic x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec r8169 snd_hwdep realtek snd_hda_core snd_pcm mei_me mei i2c_i801 prime_numbers [last unloaded: i915]
<4> [392.459349] CPU: 5 PID: 68 Comm: kworker/5:1 Tainted: G     U            5.0.0-rc4-CI-CI_DRM_5511+ #1
<4> [392.459353] Hardware name: TOSHIBA SATELLITE P50-C/06F4                            , BIOS 1.50 07/05/2017
<4> [392.459361] Workqueue: pm pm_runtime_work
<4> [392.459405] RIP: 0010:__intel_display_power_put+0xe9/0x100 [i915]
<7> [392.459408] i915 Wakeref last released:
   intel_runtime_pm_put_unchecked+0xd/0x30 [i915]
   __i915_gem_free_work+0x5a/0x90 [i915]
   process_one_work+0x245/0x610
   worker_thread+0x37/0x380
   kthread+0x119/0x130
   ret_from_fork+0x3a/0x50
<4> [392.459412] Code: e1 48 8b 30 4c 89 e7 e8 05 4f ca e0 0f 0b 8b 53 08 eb bc 44 89 e7 e8 76 fd ff ff 48 c7 c7 38 df 57 a0 48 89 c6 e8 e7 4e ca e0 <0f> 0b 8b 83 60 a2 00 00 e9 3f ff ff ff 66 2e 0f 1f 84 00 00 00 00
<7> [392.459416] i915 Wakeref count: 5
<4> [392.459418] RSP: 0018:ffffc900002d7cc8 EFLAGS: 00010286
<4> [392.459424] RAX: 0000000000000000 RBX: ffff88821f760074 RCX: 0000000000000000
<4> [392.459428] RDX: 0000000000000007 RSI: ffffffff820cbc09 RDI: 00000000ffffffff
<4> [392.459431] RBP: ffff88821f76a1e0 R08: 00000000cb90ba03 R09: 0000000000000000
<4> [392.459434] R10: ffffc900002d7cc8 R11: 0000000000000000 R12: 000000000000001d
<4> [392.459437] R13: ffff88821f760000 R14: ffffffff815309a0 R15: 0000000000000008
<4> [392.459441] FS:  0000000000000000(0000) GS:ffff888279b40000(0000) knlGS:0000000000000000
<4> [392.459444] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [392.459446] CR2: 00007ffe5af55cb7 CR3: 0000000230354003 CR4: 00000000003606e0
<4> [392.459449] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4> [392.459451] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
<4> [392.459454] Call Trace:
<4> [392.459463]  ? pci_pm_runtime_resume+0xc0/0xc0
<4> [392.459502]  intel_display_power_put_unchecked+0x9/0x10 [i915]
<7> [392.459504] i915 Wakeref x1 taken at:
   intel_display_power_get+0x18/0x50 [i915]
   snd_hdac_display_power+0x5e/0xf0 [snd_hda_core]
   azx_probe_work+0x81/0x7e0 [snd_hda_intel]
   process_one_work+0x245/0x610
   worker_thread+0x37/0x380
   kthread+0x119/0x130
   ret_from_fork+0x3a/0x50
<4> [392.459513]  snd_hdac_display_power+0xd3/0xf0 [snd_hda_core]
<4> [392.459520]  ? pci_azx_readl+0x10/0x10 [snd_hda_intel]
<4> [392.459525]  azx_runtime_suspend+0x6c/0x150 [snd_hda_intel]
<4> [392.459532]  pci_pm_runtime_suspend+0x5a/0x190
<4> [392.459539]  ? lockdep_hardirqs_on+0xe0/0x1b0
<4> [392.459547]  __rpm_callback+0xb3/0x1b0
<4> [392.459554]  ? pci_pm_runtime_resume+0xc0/0xc0
<4> [392.459562]  rpm_callback+0x1a/0x70
<4> [392.459568]  ? pci_pm_runtime_resume+0xc0/0xc0
<4> [392.459574]  rpm_suspend+0x128/0x6a0
<4> [392.459585]  pm_runtime_work+0x9b/0xb0
<4> [392.459590]  process_one_work+0x245/0x610
<4> [392.459601]  worker_thread+0x37/0x380
<4> [392.459607]  ? process_one_work+0x610/0x610
<7> [392.459609] i915 Wakeref x1 taken at:
   intel_display_power_get+0x18/0x50 [i915]
   i915_audio_component_codec_wake_override+0x2c/0xe0 [i915]
   snd_hdac_set_codec_wakeup+0x35/0x80 [snd_hda_core]
   snd_hdac_display_power+0x6b/0xf0 [snd_hda_core]
   azx_probe_work+0x81/0x7e0 [snd_hda_intel]
   process_one_work+0x245/0x610
   worker_thread+0x37/0x380
   kthread+0x119/0x130
<4> [392.459612]  kthread+0x119/0x130
<4> [392.459619]  ? kthread_park+0x80/0x80
<4> [392.459628]  ret_from_fork+0x3a/0x50
<4> [392.459642] irq event stamp: 1024316
<4> [392.459647] hardirqs last  enabled at (1024315): [<ffffffff8112250e>] console_unlock+0x3fe/0x600
<4> [392.459652] hardirqs last disabled at (1024316): [<ffffffff810019b0>] trace_hardirqs_off_thunk+0x1a/0x1c
<4> [392.459658] softirqs last  enabled at (1024126): [<ffffffff8181dbcc>] peernet2id+0x4c/0x70
<4> [392.459663] softirqs last disabled at (1024124): [<ffffffff8181dbad>] peernet2id+0x2d/0x70
<4> [392.459704] WARNING: CPU: 5 PID: 68 at drivers/gpu/drm/i915/intel_runtime_pm.c:1928 __intel_display_power_put+0xe9/0x100 [i915]
<4> [392.459707] ---[ end trace f97aafc90e0032f1 ]---
Comment 1 CI Bug Log 2019-01-31 08:33:51 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* SKL: igt@pm_rpm@module-reload - dmesg-warn - Use count on domain AUDIO is already zero
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2277/fi-skl-6700hq/igt@pm_rpm@module-reload.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_3733/fi-skl-6260u/igt@pm_rpm@module-reload.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5511/fi-skl-6700hq/igt@pm_rpm@module-reload.html
Comment 2 Chris Wilson 2019-01-31 08:43:10 UTC
Overlapping WARNs, I feel we are missing some serialisation -- hopefully just at the debug level.
Comment 3 CI Bug Log 2019-02-06 15:20:38 UTC
A CI Bug Log filter associated to this bug has been updated:

{- SKL: igt@pm_rpm@module-reload - dmesg-warn - Use count on domain AUDIO is already zero -}
{+ SKL ICL: igt@pm_rpm@module-reload - dmesg-warn - Use count on domain AUDIO is already zero +}

New failures caught by the filter:

* https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5551/fi-icl-y/igt@pm_rpm@module-reload.html
Comment 4 CI Bug Log 2019-02-06 15:32:17 UTC
A CI Bug Log filter associated to this bug has been updated:

{- SKL ICL: igt@pm_rpm@module-reload - dmesg-warn - Use count on domain AUDIO is already zero -}
{+ SKL WHL ICL: igt@pm_rpm@module-reload - dmesg-warn - Use count on domain AUDIO is already zero +}

New failures caught by the filter:

* https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5534/fi-whl-u/igt@pm_rpm@module-reload.html
Comment 6 CI Bug Log 2019-02-21 14:20:06 UTC
A CI Bug Log filter associated to this bug has been updated:

{- SKL WHL ICL: igt@runner@aborted - fail - Previous test: pm_rpm (module-reload) -}
{+ SKL WHL ICL: igt@runner@aborted - fail - Previous test: i915_pm_rpm (module-reload) +}

 No new failures caught with the new filter
Comment 7 CI Bug Log 2019-02-28 14:51:16 UTC
A CI Bug Log filter associated to this bug has been updated:

{- SKL WHL ICL: igt@pm_rpm@module-reload - dmesg-warn - Use count on domain AUDIO is already zero -}
{+ SKL KBL WHL ICL: igt@pm_rpm@module-reload - dmesg-warn - Use count on domain AUDIO is already zero +}

New failures caught by the filter:

* https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5670/fi-kbl-7560u/igt@i915_pm_rpm@module-reload.html
Comment 8 Ida 2019-03-15 11:14:30 UTC
This issue does not reproduce on ICL U (HDA Legacy Audio).

This is the Kernel we use:
https://gfx-assets.igk.intel.com/artifactory/webapp/#/artifacts/browse/tree/General/gfx-osgc-assets-igk/builds/DRM/CI_DRM/5675
Comment 9 Arek Hiler 2019-04-05 11:14:09 UTC
(In reply to Ida from comment #8)
> This issue does not reproduce on ICL U (HDA Legacy Audio).

Hey,

Thanks for letting us know. CI_DRM_5675 is from 28th of February, but we still have hit the issue after that:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5811/fi-whl-u/igt@i915_pm_rpm@module-reload.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5788/fi-whl-u/igt@i915_pm_rpm@module-reload.html
etc.

Seems like the reproduction rate is once every 150-200 runs.

We see it mostly on WHL nowadays, but it will take about 1k runs to be sensibly sure that its gone for ICL for good.
Comment 10 Lakshmi 2019-04-05 11:21:03 UTC
For customer, the impact is negligible but for CI its annoying. Setting the priority to Medium.
Comment 11 CI Bug Log 2019-04-05 20:23:48 UTC
A CI Bug Log filter associated to this bug has been updated:

{- SKL KBL WHL ICL: igt@pm_rpm@module-reload - dmesg-warn - Use count on domain AUDIO is already zero -}
{+ SKL KBL GLK WHL ICL: igt@pm_rpm@module-reload|system-suspend-execbuf - dmesg-warn/fail - Use count on domain AUDIO is already zero +}

New failures caught by the filter:

* https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5873/shard-glk2/igt@i915_pm_rpm@system-suspend-execbuf.html
Comment 12 CI Bug Log 2019-04-05 20:28:25 UTC
A CI Bug Log filter associated to this bug has been updated:

{- SKL KBL GLK WHL ICL: igt@pm_rpm@module-reload|system-suspend-execbuf - dmesg-warn/fail - Use count on domain AUDIO is already zero -}
{+ SKL KBL GLK WHL ICL: igt@pm_rpm@module-reload|system-suspend-execbuf|pm-tiling - dmesg-warn/fail - Use count on domain AUDIO is already zero +}

New failures caught by the filter:

* https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4929/shard-glk6/igt@i915_pm_rpm@pm-tiling.html
Comment 13 Imre Deak 2019-04-07 10:42:57 UTC
The issue seems to be still there:

https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_12702/shard-glk5/dmesg1.log

I think it's a significant issue that someone from the audio team should track down and fix.
Comment 14 CI Bug Log 2019-04-08 09:41:25 UTC
A CI Bug Log filter associated to this bug has been updated:

{- SKL KBL GLK WHL ICL: igt@pm_rpm@module-reload|system-suspend-execbuf|pm-tiling - dmesg-warn/fail - Use count on domain AUDIO is already zero -}
{+ SKL KBL GLK WHL ICL: igt@(i915_)?pm_rpm@* - dmesg-warn/fail - Use count on domain AUDIO is already zero +}

New failures caught by the filter:

* https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5878/shard-glk4/igt@i915_pm_rpm@dpms-mode-unset-non-lpsp.html
* https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5878/shard-glk7/igt@i915_pm_rpm@cursor-dpms.html
* https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5880/shard-glk4/igt@i915_pm_rpm@cursor-dpms.html
Comment 15 Amadeusz Sławiński 2019-04-09 13:30:08 UTC
Hi,

First of all, as I understand CI is not testing audio per se, but can you consider adding
snd_hda_core.dyndbg="+pfl" snd_soc_hdac_hdmi.dyndbg="+pfl"
for debugging HDMI Audio functionality to kernel parameters?
It will give more useful logs from audio side.

Can you tell me what exactly is being tested in this test?


One of the things I see, that in all logs, moments before warning device goes to sleep and then is being woken up.

I've took a look at where calls are performed from audio side and I don't see anything obviously wrong with them. But considering that we just finished resume it looks to me like something tries to call __intel_display_power_put() on device that is still down.

Another thing is the "enabling always-on" & "disabling always-on" sequence, in failing case it seems suspiciously short to me (in comparison with working cases), but I'm not well versed in graphics code/states, so can't tell if it's really correct.

Suspend:

<7>[  244.855833] [drm:intel_runtime_suspend [i915]] Suspending device
<7>[  244.856535] [drm:gen9_set_dc_state [i915]] Setting DC state from 01 to 00
<7>[  244.856695] [drm:intel_power_well_disable [i915]] disabling power well 1
<7>[  244.856863] [drm:hsw_wait_for_power_well_disable.isra.7 [i915]] power well 1 forced on (bios:1 driver:0 kvmr:0 debug:0)
<7>[  244.857358] [drm:bxt_enable_dc9 [i915]] Enabling DC9
<7>[  244.857415] [drm:gen9_set_dc_state [i915]] Setting DC state from 00 to 08
<7>[  244.857580] [drm:intel_runtime_suspend [i915]] Device suspended

...

<7>[  244.858101] [drm:intel_atomic_check [i915]] New cdclk calculated to be logical 79200 kHz, actual 79200 kHz
<7>[  244.858208] [drm:intel_atomic_check [i915]] New voltage level calculated to be logical 4, actual 4
<7>[  244.858903] [drm:drm_mode_setcrtc] [CRTC:125:pipe C]

Resume:

<7>[  244.881249] [drm:intel_runtime_resume [i915]] Resuming device
<7>[  244.881317] [drm:bxt_disable_dc9 [i915]] Disabling DC9
<7>[  244.881376] [drm:gen9_set_dc_state [i915]] Setting DC state from 08 to 00
<7>[  244.881458] [drm:gen9_set_dc_state [i915]] Setting DC state from 00 to 00
<7>[  244.881577] [drm:intel_power_well_enable [i915]] enabling power well 1
<7>[  244.881726] [drm:intel_dump_cdclk_state [i915]] Current CDCLK 19200 kHz, VCO 0 kHz, ref 19200 kHz, bypass 19200 kHz, voltage level 1
<7>[  244.881790] [drm:bxt_init_cdclk [i915]] Sanitizing cdclk programmed by pre-os
<7>[  244.882371] [drm:gen9_enable_dc5 [i915]] Enabling DC5
<7>[  244.882425] [drm:gen9_set_dc_state [i915]] Setting DC state from 00 to 01
<7>[  244.886970] [drm:__bxt_hpd_detection_setup [i915]] Invert bit setting: hp_ctl:10001818 hp_port:38
<7>[  244.892311] [drm:__bxt_hpd_detection_setup [i915]] Invert bit setting: hp_ctl:10001818 hp_port:30
<7>[  244.895516] [drm:intel_runtime_resume [i915]] Device resumed

Sequence between "enabling always-on" & "disabling always-on":

<7>[  244.895677] [drm:intel_power_well_enable [i915]] enabling always-on
<7>[  244.895732] [drm:intel_power_well_enable [i915]] enabling DC off
<7>[  244.896797] [drm:gen9_set_dc_state [i915]] Setting DC state from 01 to 00
<7>[  244.897856] [drm:intel_atomic_commit_tail [i915]] [ENCODER:126:DDI B]
<7>[  244.897920] [drm:intel_atomic_commit_tail [i915]] [ENCODER:135:DDI C]
<7>[  244.897981] [drm:verify_single_dpll_state.isra.108 [i915]] PORT PLL A
<7>[  244.898053] [drm:verify_single_dpll_state.isra.108 [i915]] PORT PLL B
<7>[  244.898283] [drm:verify_single_dpll_state.isra.108 [i915]] PORT PLL C
<7>[  244.898372] [drm:intel_power_well_disable [i915]] disabling DC off
<7>[  244.898434] [drm:gen9_enable_dc5 [i915]] Enabling DC5
<7>[  244.898488] [drm:gen9_set_dc_state [i915]] Setting DC state from 00 to 01
<7>[  244.902735] [drm:intel_power_well_disable [i915]] disabling always-on

Warning:

<4>[  244.904316] ------------[ cut here ]------------
<4>[  244.904321] Use count on domain AUDIO is already zero
<4>[  244.904403] WARNING: CPU: 0 PID: 274 at drivers/gpu/drm/i915/intel_runtime_pm.c:1928 __intel_display_power_put+0xe9/0x100 [i915]


Summing up, I see few suspicious things, but without additional logs enabled it's bit hard to tell anything concrete.
Comment 16 Arek Hiler 2019-04-10 07:06:49 UTC
(In reply to Amadeusz Sławiński from comment #15)
> Hi,
> 
> First of all, as I understand CI is not testing audio per se, but can you
> consider adding
> snd_hda_core.dyndbg="+pfl" snd_soc_hdac_hdmi.dyndbg="+pfl"
> for debugging HDMI Audio functionality to kernel parameters?
> It will give more useful logs from audio side.

Hey Amadeusz,

First we would have to see how much noise it would cause using some pre-merge testing. That would require us to send a kernel "HAX" patches that force those options on kernel side to intel-gfx@lists.freedesktop.org. Then, if it's not causing too much havoc, we can enable it for some of the offending machines to gather more data over time.

If you can cook up such patch and send it - that would be appreciated.
 
> Can you tell me what exactly is being tested in this test?

Test: https://gitlab.freedesktop.org/drm/igt-gpu-tools/blob/master/tests/i915/i915_pm_rpm.c#L2101
Preparing the environment: https://gitlab.freedesktop.org/drm/igt-gpu-tools/blob/master/tests/i915/i915_pm_rpm.c#L740

It reloads i915 with display disabled, waits for the graphics to enter d3, reloads again with display enabled, waits for d3 again, enables one screen and then unloads the driver altogether.
Comment 17 Jani Saarinen 2019-04-10 13:14:07 UTC
Setting this highest as blocking our CI GLK DSI to proceed. 
http://gfx-ci.fi.intel.com/tree/drm-tip/fi-glk-dsi.html
Comment 18 Chris Wilson 2019-04-10 13:16:56 UTC
commit 4bdc42094d9c4af75f90ee49b9009ea8dfe41444 (HEAD -> drm-intel-next-queued, 
drm-intel/for-linux-next, drm-intel/drm-intel-next-queued)
Author: Takashi Iwai <tiwai@suse.de>
Date:   Wed Apr 10 13:03:22 2019 +0200

    ALSA: hda: Fix racy display power access
    
    snd_hdac_display_power() doesn't handle the concurrent calls carefully
    enough, and it may lead to the doubly get_power or put_power calls,
    when a runtime PM and an async work get called in racy way.
    
    This patch addresses it by reusing the bus->lock mutex that has been
    used for protecting the link state change in ext bus code, so that it
    can protect against racy display state changes.  The initialization of
    bus->lock was moved from snd_hdac_ext_bus_init() to
    snd_hdac_bus_init() as well accordingly.
    
    Testcase: igt/i915_pm_rpm/module-reload #glk-dsi
    Reported-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Imre Deak <imre.deak@intel.com>
    Signed-off-by: Takashi Iwai <tiwai@suse.de>
    Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Link: https://patchwork.freedesktop.org/patch/msgid/s5h8swiunph.wl-tiwai@suse.de
Comment 19 Martin Peres 2019-04-17 14:42:06 UTC
(In reply to Chris Wilson from comment #18)
> commit 4bdc42094d9c4af75f90ee49b9009ea8dfe41444 (HEAD ->
> drm-intel-next-queued, 
> drm-intel/for-linux-next, drm-intel/drm-intel-next-queued)
> Author: Takashi Iwai <tiwai@suse.de>
> Date:   Wed Apr 10 13:03:22 2019 +0200
> 
>     ALSA: hda: Fix racy display power access
>     
>     snd_hdac_display_power() doesn't handle the concurrent calls carefully
>     enough, and it may lead to the doubly get_power or put_power calls,
>     when a runtime PM and an async work get called in racy way.
>     
>     This patch addresses it by reusing the bus->lock mutex that has been
>     used for protecting the link state change in ext bus code, so that it
>     can protect against racy display state changes.  The initialization of
>     bus->lock was moved from snd_hdac_ext_bus_init() to
>     snd_hdac_bus_init() as well accordingly.
>     
>     Testcase: igt/i915_pm_rpm/module-reload #glk-dsi
>     Reported-by: Chris Wilson <chris@chris-wilson.co.uk>
>     Cc: Imre Deak <imre.deak@intel.com>
>     Signed-off-by: Takashi Iwai <tiwai@suse.de>
>     Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
>     Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>     Link:
> https://patchwork.freedesktop.org/patch/msgid/s5h8swiunph.wl-tiwai@suse.de

It definitely fixed the issue! Thanks!
Comment 20 CI Bug Log 2019-04-17 14:42:12 UTC
The CI Bug Log issue associated to this bug has been archived.

New failures matching the above filters will not be associated to this bug anymore.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.