Bug 108800 - [CI][BAT] igt@pm_rpm@(module-reload|basic-rte) - fail - Failed assertion: setup_environment()
Summary: [CI][BAT] igt@pm_rpm@(module-reload|basic-rte) - fail - Failed assertion: set...
Status: RESOLVED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: Other All
: high normal
Assignee: Vanshidhar Konda
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2018-11-19 16:45 UTC by Martin Peres
Modified: 2019-03-30 20:55 UTC (History)
4 users (show)

See Also:
i915 platform: BSW/CHT, BYT, CNL, SKL
i915 features: power/runtime PM


Attachments
Basic-rte logs (9.91 MB, text/plain)
2019-03-21 10:17 UTC, Ida
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Peres 2018-11-19 16:45:49 UTC
The following happened twice in 245 runs on fi-cnl-u:

Starting subtest: module-reload
(pm_rpm:3770) CRITICAL: Test assertion failure function main, file ../tests/pm_rpm.c:2088:
(pm_rpm:3770) CRITICAL: Failed assertion: setup_environment()
Subtest module-reload failed.
Comment 1 Martin Peres 2018-12-03 14:46:29 UTC
Also seen on SKL: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5186_157/fi-skl-6600u/igt@pm_rpm@basic-rte.html

Starting subtest: basic-rte
(pm_rpm:3173) CRITICAL: Test assertion failure function main, file ../tests/pm_rpm.c:1948:
(pm_rpm:3173) CRITICAL: Failed assertion: setup_environment()
Subtest basic-rte failed.
Comment 2 Martin Peres 2018-12-17 17:07:59 UTC
Also seen on BSW and BYT: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5322/fi-byt-j1900/igt@pm_rpm@basic-rte.html

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5321/fi-bsw-cyan/igt@pm_rpm@basic-rte.html

Starting subtest: basic-rte
(pm_rpm:2797) CRITICAL: Test assertion failure function main, file ../tests/pm_rpm.c:1967:
(pm_rpm:2797) CRITICAL: Failed assertion: setup_environment()
Subtest basic-rte failed.
Comment 3 Lakshmi 2019-02-28 07:08:20 UTC
Failed to suspend the device when idle, which leads to consume more battery than expected. Setting the priority to High based on the impact of this bug.
Comment 4 Vanshidhar Konda 2019-03-04 17:43:35 UTC
This test seems to fail mostly on the BSW and BYT systems. Here's the observations I've made so far:

1) The kernel device usage count for suspending the device is higher (+1) in the failed execution of the test than the successful execution.

2) When the test disables all screens (through modeset) power-wells in the display domain are being turned off - except the display and always-on power-wells.

3) In the failed case, there is a difference in the wakeref acquire/release log. The following wakerefs don't happen for the successful executions.
Wakeref last acquired:
   intel_display_power_get+0x18/0x50 [i915]
   intel_power_domains_init_hw+0x90/0x500 [i915]
   intel_power_domains_resume+0x3d/0x70 [i915]
   i915_pm_resume_early+0x9d/0x130 [i915]
   dpm_run_callback+0x64/0x280
   device_resume_early+0xa6/0xe0
   async_resume_early+0x14/0x40
   async_run_entry_fn+0x34/0x160
Wakeref last released:
   i915_drm_suspend_late+0xad/0x120 [i915]
   dpm_run_callback+0x64/0x280
   __device_suspend_late+0xad/0x140
   async_suspend_late+0x15/0x90
   async_run_entry_fn+0x34/0x160
   process_one_work+0x245/0x610
   worker_thread+0x37/0x380
   kthread+0x119/0x130

Possible reasons for failure:
The display and always-on power-wells have a higher reference count than the rest of the power-wells in the display domain.

Next steps:
1) Confirm that the reference count on display and always-on power-wells is different between test fail/success case.
2) If confirmed, try to figure out the reason for extra reference count on the power-wells in question.
Comment 5 Imre Deak 2019-03-07 11:16:26 UTC
(In reply to Vanshidhar Konda from comment #4)
> This test seems to fail mostly on the BSW and BYT systems. Here's the
> observations I've made so far:
> 
> 1) The kernel device usage count for suspending the device is higher (+1) in
> the failed execution of the test than the successful execution.
> 
> 2) When the test disables all screens (through modeset) power-wells in the
> display domain are being turned off - except the display and always-on
> power-wells.
> 
> 3) In the failed case, there is a difference in the wakeref acquire/release
> log. The following wakerefs don't happen for the successful executions.
> Wakeref last acquired:
>    intel_display_power_get+0x18/0x50 [i915]
>    intel_power_domains_init_hw+0x90/0x500 [i915]
>    intel_power_domains_resume+0x3d/0x70 [i915]
>    i915_pm_resume_early+0x9d/0x130 [i915]
>    dpm_run_callback+0x64/0x280
>    device_resume_early+0xa6/0xe0
>    async_resume_early+0x14/0x40
>    async_run_entry_fn+0x34/0x160
> Wakeref last released:
>    i915_drm_suspend_late+0xad/0x120 [i915]
>    dpm_run_callback+0x64/0x280
>    __device_suspend_late+0xad/0x140
>    async_suspend_late+0x15/0x90
>    async_run_entry_fn+0x34/0x160
>    process_one_work+0x245/0x610
>    worker_thread+0x37/0x380
>    kthread+0x119/0x130
> 
> Possible reasons for failure:
> The display and always-on power-wells have a higher reference count than the
> rest of the power-wells in the display domain.
> 
> Next steps:
> 1) Confirm that the reference count on display and always-on power-wells is
> different between test fail/success case.
> 2) If confirmed, try to figure out the reason for extra reference count on
> the power-wells in question.

This could be due to the audio driver not suspending.

Could you provide the contents of /sys/kernel/debug/dri/0/i915_power_domain_info after all screen gets disabled (with runtime PM enabled for both the i915 and the audio driver)?
Comment 6 Imre Deak 2019-03-07 11:40:42 UTC
Note that you could get the equivalent info by running the test with
https://patchwork.freedesktop.org/patch/290262/?series=57526&rev=1
Comment 7 Vanshidhar Konda 2019-03-07 16:50:44 UTC
Is there something that (In reply to Imre Deak from comment #5)
> (In reply to Vanshidhar Konda from comment #4)
> > This test seems to fail mostly on the BSW and BYT systems. Here's the
> > observations I've made so far:
> > 
> > 1) The kernel device usage count for suspending the device is higher (+1) in
> > the failed execution of the test than the successful execution.
> > 
> > 2) When the test disables all screens (through modeset) power-wells in the
> > display domain are being turned off - except the display and always-on
> > power-wells.
> > 
> > 3) In the failed case, there is a difference in the wakeref acquire/release
> > log. The following wakerefs don't happen for the successful executions.
> > Wakeref last acquired:
> >    intel_display_power_get+0x18/0x50 [i915]
> >    intel_power_domains_init_hw+0x90/0x500 [i915]
> >    intel_power_domains_resume+0x3d/0x70 [i915]
> >    i915_pm_resume_early+0x9d/0x130 [i915]
> >    dpm_run_callback+0x64/0x280
> >    device_resume_early+0xa6/0xe0
> >    async_resume_early+0x14/0x40
> >    async_run_entry_fn+0x34/0x160
> > Wakeref last released:
> >    i915_drm_suspend_late+0xad/0x120 [i915]
> >    dpm_run_callback+0x64/0x280
> >    __device_suspend_late+0xad/0x140
> >    async_suspend_late+0x15/0x90
> >    async_run_entry_fn+0x34/0x160
> >    process_one_work+0x245/0x610
> >    worker_thread+0x37/0x380
> >    kthread+0x119/0x130
> > 
> > Possible reasons for failure:
> > The display and always-on power-wells have a higher reference count than the
> > rest of the power-wells in the display domain.
> > 
> > Next steps:
> > 1) Confirm that the reference count on display and always-on power-wells is
> > different between test fail/success case.
> > 2) If confirmed, try to figure out the reason for extra reference count on
> > the power-wells in question.
> 
> This could be due to the audio driver not suspending.
> 
> Could you provide the contents of
> /sys/kernel/debug/dri/0/i915_power_domain_info after all screen gets
> disabled (with runtime PM enabled for both the i915 and the audio driver)?

On the CI systems it seems like runtime PM is enabled for i915. How can I setup/check if runtime PM is enabled for the audio driver?
Comment 8 Vanshidhar Konda 2019-03-08 02:14:28 UTC
CI reported a few failures on BSW/BYT systems after taking the patch to IGT from Chris.

This shows that there is 1 reference remaining even after disabling all the screens. Also, like you pointed out earlier, there is a reference from the audio driver that is not present in successful runs.

Wakeref x1 taken at:
   intel_display_power_get+0x18/0x50 [i915]
   i915_audio_component_get_power+0x11/0x20 [i915]
   snd_hdac_display_power+0x6a/0x100 [snd_hda_core]
   hda_codec_runtime_resume+0x52/0x60 [snd_hda_codec]
   pm_runtime_force_resume+0x6a/0xd0
   dpm_run_callback+0x64/0x280
   device_resume+0xb3/0x1e0
   async_resume+0x14/0x40

But, why would this reference be taken only in the failure cases that happens only in some of the runs?

Links to a few failures after the patch to IGT from Chris:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5718/fi-byt-squawks/igt@i915_pm_rpm@basic-rte.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5718/fi-byt-j1900/igt@i915_pm_rpm@basic-rte.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5718/fi-bsw-cyan/igt@i915_pm_rpm@basic-rte.html
Comment 9 Ida 2019-03-21 10:17:44 UTC
Created attachment 143745 [details]
Basic-rte logs

Hello,
we've executed subtest basic-rte 1000 times on ICL-U. The issue does not reproduce.
Comment 10 Vanshidhar Konda 2019-03-21 17:07:06 UTC
(In reply to Ida from comment #9)
> Created attachment 143745 [details]
> Basic-rte logs
> 
> Hello,
> we've executed subtest basic-rte 1000 times on ICL-U. The issue does not
> reproduce.

Hello, the issue is only seen on Braswell and Baytrail systems - mostly Chromebooks.
Comment 11 Chris Wilson 2019-03-30 20:55:54 UTC
Looks like either -rc1 or Petri's static analysis cleanup of lib/igt_pm.c made this disappear from BAT. At least, I can't see any residual wakeref caused by snd_hda over the last couple of weeks.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.