Bug 108800 - [CI][BAT] igt@pm_rpm@(module-reload|basic-rte) - fail - Failed assertion: setup_environment()
Summary: [CI][BAT] igt@pm_rpm@(module-reload|basic-rte) - fail - Failed assertion: set...
Status: RESOLVED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: Other All
: high normal
Assignee: Vanshidhar Konda
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2018-11-19 16:45 UTC by Martin Peres
Modified: 2019-05-07 07:19 UTC (History)
4 users (show)

See Also:
i915 platform: BSW/CHT, BXT, BYT, CNL, SKL
i915 features: power/runtime PM


Attachments
Basic-rte logs (9.91 MB, text/plain)
2019-03-21 10:17 UTC, Ida
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Peres 2018-11-19 16:45:49 UTC
The following happened twice in 245 runs on fi-cnl-u:

Starting subtest: module-reload
(pm_rpm:3770) CRITICAL: Test assertion failure function main, file ../tests/pm_rpm.c:2088:
(pm_rpm:3770) CRITICAL: Failed assertion: setup_environment()
Subtest module-reload failed.
Comment 1 Martin Peres 2018-12-03 14:46:29 UTC
Also seen on SKL: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5186_157/fi-skl-6600u/igt@pm_rpm@basic-rte.html

Starting subtest: basic-rte
(pm_rpm:3173) CRITICAL: Test assertion failure function main, file ../tests/pm_rpm.c:1948:
(pm_rpm:3173) CRITICAL: Failed assertion: setup_environment()
Subtest basic-rte failed.
Comment 2 Martin Peres 2018-12-17 17:07:59 UTC
Also seen on BSW and BYT: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5322/fi-byt-j1900/igt@pm_rpm@basic-rte.html

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5321/fi-bsw-cyan/igt@pm_rpm@basic-rte.html

Starting subtest: basic-rte
(pm_rpm:2797) CRITICAL: Test assertion failure function main, file ../tests/pm_rpm.c:1967:
(pm_rpm:2797) CRITICAL: Failed assertion: setup_environment()
Subtest basic-rte failed.
Comment 3 Lakshmi 2019-02-28 07:08:20 UTC
Failed to suspend the device when idle, which leads to consume more battery than expected. Setting the priority to High based on the impact of this bug.
Comment 4 Vanshidhar Konda 2019-03-04 17:43:35 UTC
This test seems to fail mostly on the BSW and BYT systems. Here's the observations I've made so far:

1) The kernel device usage count for suspending the device is higher (+1) in the failed execution of the test than the successful execution.

2) When the test disables all screens (through modeset) power-wells in the display domain are being turned off - except the display and always-on power-wells.

3) In the failed case, there is a difference in the wakeref acquire/release log. The following wakerefs don't happen for the successful executions.
Wakeref last acquired:
   intel_display_power_get+0x18/0x50 [i915]
   intel_power_domains_init_hw+0x90/0x500 [i915]
   intel_power_domains_resume+0x3d/0x70 [i915]
   i915_pm_resume_early+0x9d/0x130 [i915]
   dpm_run_callback+0x64/0x280
   device_resume_early+0xa6/0xe0
   async_resume_early+0x14/0x40
   async_run_entry_fn+0x34/0x160
Wakeref last released:
   i915_drm_suspend_late+0xad/0x120 [i915]
   dpm_run_callback+0x64/0x280
   __device_suspend_late+0xad/0x140
   async_suspend_late+0x15/0x90
   async_run_entry_fn+0x34/0x160
   process_one_work+0x245/0x610
   worker_thread+0x37/0x380
   kthread+0x119/0x130

Possible reasons for failure:
The display and always-on power-wells have a higher reference count than the rest of the power-wells in the display domain.

Next steps:
1) Confirm that the reference count on display and always-on power-wells is different between test fail/success case.
2) If confirmed, try to figure out the reason for extra reference count on the power-wells in question.
Comment 5 Imre Deak 2019-03-07 11:16:26 UTC
(In reply to Vanshidhar Konda from comment #4)
> This test seems to fail mostly on the BSW and BYT systems. Here's the
> observations I've made so far:
> 
> 1) The kernel device usage count for suspending the device is higher (+1) in
> the failed execution of the test than the successful execution.
> 
> 2) When the test disables all screens (through modeset) power-wells in the
> display domain are being turned off - except the display and always-on
> power-wells.
> 
> 3) In the failed case, there is a difference in the wakeref acquire/release
> log. The following wakerefs don't happen for the successful executions.
> Wakeref last acquired:
>    intel_display_power_get+0x18/0x50 [i915]
>    intel_power_domains_init_hw+0x90/0x500 [i915]
>    intel_power_domains_resume+0x3d/0x70 [i915]
>    i915_pm_resume_early+0x9d/0x130 [i915]
>    dpm_run_callback+0x64/0x280
>    device_resume_early+0xa6/0xe0
>    async_resume_early+0x14/0x40
>    async_run_entry_fn+0x34/0x160
> Wakeref last released:
>    i915_drm_suspend_late+0xad/0x120 [i915]
>    dpm_run_callback+0x64/0x280
>    __device_suspend_late+0xad/0x140
>    async_suspend_late+0x15/0x90
>    async_run_entry_fn+0x34/0x160
>    process_one_work+0x245/0x610
>    worker_thread+0x37/0x380
>    kthread+0x119/0x130
> 
> Possible reasons for failure:
> The display and always-on power-wells have a higher reference count than the
> rest of the power-wells in the display domain.
> 
> Next steps:
> 1) Confirm that the reference count on display and always-on power-wells is
> different between test fail/success case.
> 2) If confirmed, try to figure out the reason for extra reference count on
> the power-wells in question.

This could be due to the audio driver not suspending.

Could you provide the contents of /sys/kernel/debug/dri/0/i915_power_domain_info after all screen gets disabled (with runtime PM enabled for both the i915 and the audio driver)?
Comment 6 Imre Deak 2019-03-07 11:40:42 UTC
Note that you could get the equivalent info by running the test with
https://patchwork.freedesktop.org/patch/290262/?series=57526&rev=1
Comment 7 Vanshidhar Konda 2019-03-07 16:50:44 UTC
Is there something that (In reply to Imre Deak from comment #5)
> (In reply to Vanshidhar Konda from comment #4)
> > This test seems to fail mostly on the BSW and BYT systems. Here's the
> > observations I've made so far:
> > 
> > 1) The kernel device usage count for suspending the device is higher (+1) in
> > the failed execution of the test than the successful execution.
> > 
> > 2) When the test disables all screens (through modeset) power-wells in the
> > display domain are being turned off - except the display and always-on
> > power-wells.
> > 
> > 3) In the failed case, there is a difference in the wakeref acquire/release
> > log. The following wakerefs don't happen for the successful executions.
> > Wakeref last acquired:
> >    intel_display_power_get+0x18/0x50 [i915]
> >    intel_power_domains_init_hw+0x90/0x500 [i915]
> >    intel_power_domains_resume+0x3d/0x70 [i915]
> >    i915_pm_resume_early+0x9d/0x130 [i915]
> >    dpm_run_callback+0x64/0x280
> >    device_resume_early+0xa6/0xe0
> >    async_resume_early+0x14/0x40
> >    async_run_entry_fn+0x34/0x160
> > Wakeref last released:
> >    i915_drm_suspend_late+0xad/0x120 [i915]
> >    dpm_run_callback+0x64/0x280
> >    __device_suspend_late+0xad/0x140
> >    async_suspend_late+0x15/0x90
> >    async_run_entry_fn+0x34/0x160
> >    process_one_work+0x245/0x610
> >    worker_thread+0x37/0x380
> >    kthread+0x119/0x130
> > 
> > Possible reasons for failure:
> > The display and always-on power-wells have a higher reference count than the
> > rest of the power-wells in the display domain.
> > 
> > Next steps:
> > 1) Confirm that the reference count on display and always-on power-wells is
> > different between test fail/success case.
> > 2) If confirmed, try to figure out the reason for extra reference count on
> > the power-wells in question.
> 
> This could be due to the audio driver not suspending.
> 
> Could you provide the contents of
> /sys/kernel/debug/dri/0/i915_power_domain_info after all screen gets
> disabled (with runtime PM enabled for both the i915 and the audio driver)?

On the CI systems it seems like runtime PM is enabled for i915. How can I setup/check if runtime PM is enabled for the audio driver?
Comment 8 Vanshidhar Konda 2019-03-08 02:14:28 UTC
CI reported a few failures on BSW/BYT systems after taking the patch to IGT from Chris.

This shows that there is 1 reference remaining even after disabling all the screens. Also, like you pointed out earlier, there is a reference from the audio driver that is not present in successful runs.

Wakeref x1 taken at:
   intel_display_power_get+0x18/0x50 [i915]
   i915_audio_component_get_power+0x11/0x20 [i915]
   snd_hdac_display_power+0x6a/0x100 [snd_hda_core]
   hda_codec_runtime_resume+0x52/0x60 [snd_hda_codec]
   pm_runtime_force_resume+0x6a/0xd0
   dpm_run_callback+0x64/0x280
   device_resume+0xb3/0x1e0
   async_resume+0x14/0x40

But, why would this reference be taken only in the failure cases that happens only in some of the runs?

Links to a few failures after the patch to IGT from Chris:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5718/fi-byt-squawks/igt@i915_pm_rpm@basic-rte.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5718/fi-byt-j1900/igt@i915_pm_rpm@basic-rte.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5718/fi-bsw-cyan/igt@i915_pm_rpm@basic-rte.html
Comment 9 Ida 2019-03-21 10:17:44 UTC
Created attachment 143745 [details]
Basic-rte logs

Hello,
we've executed subtest basic-rte 1000 times on ICL-U. The issue does not reproduce.
Comment 10 Vanshidhar Konda 2019-03-21 17:07:06 UTC
(In reply to Ida from comment #9)
> Created attachment 143745 [details]
> Basic-rte logs
> 
> Hello,
> we've executed subtest basic-rte 1000 times on ICL-U. The issue does not
> reproduce.

Hello, the issue is only seen on Braswell and Baytrail systems - mostly Chromebooks.
Comment 11 Chris Wilson 2019-03-30 20:55:54 UTC
Looks like either -rc1 or Petri's static analysis cleanup of lib/igt_pm.c made this disappear from BAT. At least, I can't see any residual wakeref caused by snd_hda over the last couple of weeks.
Comment 12 CI Bug Log 2019-04-27 18:45:43 UTC
A CI Bug Log filter associated to this bug has been updated:

{- BYT BSW SKL CNL: igt@pm_rpm@(module-reload|basic-rte) - fail - Failed assertion: setup_environment() -}
{+ BYT BSW APL SKL CNL: igt@pm_rpm@(module-reload|basic-rte) - fail - Failed assertion: setup_environment() +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_265/fi-apl-guc/igt@i915_pm_rpm@basic-rte.html
  * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_267/fi-apl-guc/igt@i915_pm_rpm@basic-rte.html
Comment 13 CI Bug Log 2019-04-27 19:34:22 UTC
A CI Bug Log filter associated to this bug has been updated:

{- BYT BSW APL SKL CNL: igt@pm_rpm@(module-reload|basic-rte) - fail - Failed assertion: setup_environment() -}
{+ BYT BSW APL SKL CFL CNL: igt@pm_rpm@(module-reload|basic-rte) - fail - Failed assertion: setup_environment() +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_265/fi-cfl-guc/igt@i915_pm_rpm@basic-rte.html
  * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_266/fi-cfl-guc/igt@i915_pm_rpm@basic-rte.html
  * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_267/fi-cfl-guc/igt@i915_pm_rpm@basic-rte.html
Comment 14 CI Bug Log 2019-05-07 07:19:42 UTC
A CI Bug Log filter associated to this bug has been updated:

{- BYT BSW APL SKL CFL CNL: igt@pm_rpm@(module-reload|basic-rte) - fail - Failed assertion: setup_environment() -}
{+ BYT BSW APL SKL CFL CNL: igt@pm_rpm@(module-reload|basic-rte) - fail/dmesg-fail - Failed assertion: setup_environment() +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_274/fi-apl-guc/igt@i915_pm_rpm@basic-rte.html
  * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_274/fi-cfl-guc/igt@i915_pm_rpm@basic-rte.html
  * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_274/fi-skl-guc/igt@i915_pm_rpm@basic-rte.html


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.