Bug 111214

Summary: [CI][BAT] igt@i915_module_load@reload - timeout - Unable to sync register 0x2f8100
Product: DRI Reporter: Martin Peres <martin.peres>
Component: DRM/IntelAssignee: Cezary Rojewski <cezary.j.rojewski>
Status: RESOLVED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: major    
Priority: high CC: intel-gfx-bugs, kai.vehmanen, stanislav.lisovskiy
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard: ReadyForDev
i915 platform: ICL i915 features: display/audio

Description Martin Peres 2019-07-25 08:29:22 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6546/fi-icl-u2/igt@i915_module_load@reload.html

<6> [385.496790] Console: switching to colour dummy device 80x25
<6> [385.496825] [IGT] i915_module_load: executing
<6> [385.499252] [IGT] i915_module_load: starting subtest reload
<4> [388.633911] snd_hda_intel 0000:00:1f.3: azx_get_response timeout, switching to polling mode: last cmd=0x202f8100
<4> [389.635866] snd_hda_intel 0000:00:1f.3: No response from codec, disabling MSI: last cmd=0x202f8100
<3> [581.484906] snd_hda_codec_hdmi hdaudioC0D2: Unable to sync register 0x2f8100. -11
<7> [581.485225] [drm:i915_audio_component_get_eld [i915]] Not valid for port B
<7> [581.485342] [drm:i915_audio_component_get_eld [i915]] Not valid for port B
<7> [581.485438] [drm:i915_audio_component_get_eld [i915]] Not valid for port B
<7> [581.485528] [drm:i915_audio_component_get_eld [i915]] Not valid for port C
<7> [581.485615] [drm:i915_audio_component_get_eld [i915]] Not valid for port C
<7> [581.485703] [drm:i915_audio_component_get_eld [i915]] Not valid for port C
<7> [581.485900] [drm:i915_audio_component_get_eld [i915]] Not valid for port D
<7> [581.486017] [drm:i915_audio_component_get_eld [i915]] Not valid for port D
<7> [581.486140] [drm:i915_audio_component_get_eld [i915]] Not valid for port E
<7> [581.486297] [drm:i915_audio_component_get_eld [i915]] Not valid for port E
<7> [581.486403] [drm:i915_audio_component_get_eld [i915]] Not valid for port E
<7> [581.486510] [drm:i915_audio_component_get_eld [i915]] Not valid for port F
<7> [581.486615] [drm:i915_audio_component_get_eld [i915]] Not valid for port F
<7> [581.486723] [drm:i915_audio_component_get_eld [i915]] Not valid for port F
Comment 2 CI Bug Log 2019-07-26 06:36:39 UTC
A CI Bug Log filter associated to this bug has been updated:

{- ICL: igt@i915_module_load@reload - timeout - Unable to sync register 0x2f8100 -}
{+ ICL: igt@i915_module_load@(reload|module-reload) - timeout - Unable to sync register 0x2f8100 +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6550/fi-icl-y/igt@i915_pm_rpm@module-reload.html
Comment 3 Lakshmi 2019-07-26 08:09:25 UTC
Filip, this bug needs assessment. Can you please add your comments and set appropriate severity and priority?
Comment 4 Cezary Rojewski 2019-07-30 09:10:27 UTC
Issue seems to be connected with SGPC - Shared Graphics Power Component. That is, during power state change on gfx side, codec -or clock (not entirely sure which one) may be turned off without notifying audio side about such change.

Now, from audio side codec is still present on the link and thus counted as active. And thus verbs are still being sent to it - timing out in consequence.

Problem seems to exist since SKL and isn't strictly tied to ICL. The only reason it is visible now, is probably of yet another issue present for ICL platforms. We need to fix the root cause though. GFX guys must be engaged and comment on this.
Comment 5 Stanislav Lisovskiy 2019-08-05 07:17:30 UTC
The issue is indeed related to power management and suspend/resume logic. 

I could reproduce it with suspending/resuming the machine, azx_response_timeout can be reproduced easily by enabling codec while disabling correspondent power domain. I also discovered that at least in Ubuntu if I have DP audio device available before suspend, then after suspend it disappears. 

There seems to be some issue/mismatch in intel_audio_codec_enable/intel_audio_codec_disable calls. I'm now also investigating it - to check which issues we might need to fix from GFX side.
Comment 6 Stanislav Lisovskiy 2019-08-07 13:44:07 UTC
Looks like that power well might be not an issue, as power wells 2,3 were enabled just before the timeout. 

However the reason seems to be in a codec itself, at least according to that mailing thread:

https://www.mail-archive.com/intel-gfx@lists.freedesktop.org/msg198762.html
Comment 7 Chris Wilson 2019-08-09 07:12:16 UTC
Note that although BAT is not reporting the timeout anymore, it is still present as icl-u2: 198.37 igt@i915_module_load@reload dmesg-warn
Comment 8 Stanislav Lisovskiy 2019-08-19 13:35:03 UTC
(In reply to Chris Wilson from comment #7)
> Note that although BAT is not reporting the timeout anymore, it is still
> present as icl-u2: 198.37 igt@i915_module_load@reload dmesg-warn

Do I understand correctly that this is supposed to be fixed by Takashi Iwai?

Patches: https://patchwork.freedesktop.org/patch/319867/
Comment 9 Lakshmi 2019-08-20 12:53:13 UTC
Comments from Stan: This issue can lead to unpredictable timeouts when verbs are sent to code and it doesn't break the system, not even audio I guess. Most likely this is going to be fixed by this: https://patchwork.freedesktop.org/patch/319867/

Updating the priority and severity.
Comment 10 Stanislav Lisovskiy 2019-08-26 11:53:41 UTC
Based on CI results, last seen one month ago. Probably above mentioned issue in the codec was fixed. Resolving as fixed.
Comment 11 Chris Wilson 2019-08-26 11:55:27 UTC
Module reload still takes over 190s. It is not fixed.
Comment 12 Stanislav Lisovskiy 2019-08-26 12:04:19 UTC
(In reply to Chris Wilson from comment #11)
> Module reload still takes over 190s. It is not fixed.

I thought that was about audio issue, i.e "snd_hda_intel 0000:00:1f.3: azx_get_response timeout, switching to polling mode: last cmd=0x202f8100
<4> [389.635866] snd_hda_intel 0000:00:1f.3: No response from codec, disabling MSI: last cmd=0x202f8100
<3> [581.484906] snd_hda_codec_hdmi hdaudioC0D2: Unable to sync register 0x2f8100. -11"

CI bug log filter doesn't catch that issue anymore - http://gfx-ci.fi.intel.com/cibuglog-ng/issue/1623

So we need to either edit filter or create another bug.
Comment 13 Lakshmi 2019-08-26 12:53:31 UTC
(In reply to Stanislav Lisovskiy from comment #12)
> (In reply to Chris Wilson from comment #11)
> > Module reload still takes over 190s. It is not fixed.
> 
> I thought that was about audio issue, i.e "snd_hda_intel 0000:00:1f.3:
> azx_get_response timeout, switching to polling mode: last cmd=0x202f8100
> <4> [389.635866] snd_hda_intel 0000:00:1f.3: No response from codec,
> disabling MSI: last cmd=0x202f8100
> <3> [581.484906] snd_hda_codec_hdmi hdaudioC0D2: Unable to sync register
> 0x2f8100. -11"
> 
> CI bug log filter doesn't catch that issue anymore -
> http://gfx-ci.fi.intel.com/cibuglog-ng/issue/1623
> 
> So we need to either edit filter or create another bug.
Stan, that's part of Bug 110595.
Comment 14 Stanislav Lisovskiy 2019-08-26 13:10:44 UTC
(In reply to Lakshmi from comment #13)
> (In reply to Stanislav Lisovskiy from comment #12)
> > (In reply to Chris Wilson from comment #11)
> > > Module reload still takes over 190s. It is not fixed.
> > 
> > I thought that was about audio issue, i.e "snd_hda_intel 0000:00:1f.3:
> > azx_get_response timeout, switching to polling mode: last cmd=0x202f8100
> > <4> [389.635866] snd_hda_intel 0000:00:1f.3: No response from codec,
> > disabling MSI: last cmd=0x202f8100
> > <3> [581.484906] snd_hda_codec_hdmi hdaudioC0D2: Unable to sync register
> > 0x2f8100. -11"
> > 
> > CI bug log filter doesn't catch that issue anymore -
> > http://gfx-ci.fi.intel.com/cibuglog-ng/issue/1623
> > 
> > So we need to either edit filter or create another bug.
> Stan, that's part of Bug 110595.

Despite it is also about module reload, I don't think those have anything in common except that issue happens also in module_reload. 110595 is about dmesg warn which we get because Port D was incorrectly marked as non-legacy TypeC port. Which I believe is not related anyhow to audio codec issue, we are dealing with here. Also I don't think it affects module_load@reload execution time, as it is just a dmesg-warn.

I guess we really need to avoid this kind generic bugs to which we associate all kinds of bugs/issues which happen during this particular test case.  

Initially this bug was about azx_response_timeout message we get during module_load, which is related to audio codec.
Comment 15 CI Bug Log 2019-08-27 12:19:43 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* ICL: igt@i915_module_load@reload - dseg-warn - azx_get_response timeout, switching to polling mode: last cmd=.*
  (No new failures associated)
Comment 16 Stanislav Lisovskiy 2019-08-27 12:42:06 UTC
(In reply to CI Bug Log from comment #15)
> The CI Bug Log issue associated to this bug has been updated.
> 
> ### New filters associated
> 
> * ICL: igt@i915_module_load@reload - dseg-warn - azx_get_response timeout,
> switching to polling mode: last cmd=.*
>   (No new failures associated)

Lakshmi, there is a typo - "dseg-warn"
Comment 17 CI Bug Log 2019-08-28 15:40:21 UTC
A CI Bug Log filter associated to this bug has been updated:

{- ICL: igt@i915_module_load@reload - dseg-warn - azx_get_response timeout, switching to polling mode: last cmd=.* -}
{+ ICL: igt@i915_module_load@reload - dmesg-warn - azx_get_response timeout, switching to polling mode: last cmd=.* +}


  No new failures caught with the new filter
Comment 18 Kai Vehmanen 2019-09-19 06:27:03 UTC
(In reply to Cezary Rojewski from comment #4)
> Problem seems to exist since SKL and isn't strictly tied to ICL. The only
> reason it is visible now, is probably of yet another issue present for ICL
> platforms. We need to fix the root cause though. GFX guys must be engaged
> and comment on this.

This seems to be the case. I'm debugging this with SOF audio driver and it seems to match this description:

"[ICL] occasional display audio codec probe failure - fail to get afg sub nodes"
https://github.com/thesofproject/linux/issues/1184

The reproduction rate goes up when an arbitrary delay is added between display probe and i915 audio component setup. Bug is fixed if i915 power domains are kept up during the whole sequence. Latest analysis in comment:
https://github.com/thesofproject/linux/issues/1184#issuecomment-532765982
Comment 19 Kai Vehmanen 2019-10-01 16:48:19 UTC
Potential bugfix to this issue submitted on intel-gfx list:
https://lists.freedesktop.org/archives/intel-gfx/2019-October/214621.html
Comment 20 Kai Vehmanen 2019-10-02 08:29:37 UTC
OK, some encouraging results from CI (if I'm interpreting the CI results correctly):

https://patchwork.freedesktop.org/series/67460/
...
  * igt@i915_module_load@reload:
    - fi-icl-u3:          [DMESG-WARN][7] ([fdo#107724] / [fdo#111214]) -> [PASS][8]
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6986/fi-icl-u3/igt@i915_module_load@reload.html
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_14617/fi-icl-u3/igt@i915_module_load@reload.html

... based on the logs, it seem HDMI audio probe fails in baseline, but now succeeded with the patch!
Comment 21 Chris Wilson 2019-10-04 20:11:46 UTC
commit 1580d3cdddbba4a5ef78a04a5289e32844e6af24
Author: Kai Vehmanen <kai.vehmanen@linux.intel.com>
Date:   Thu Oct 3 11:55:30 2019 +0300

    drm/i915: Fix audio power up sequence for gen10+ display
    
    On platfroms with gen10+ display, driver must set the enable bit of
    AUDIO_PIN_BUF_CTL register before transactions with the HDA controller
    can proceed. Add setting this bit to the audio power up sequence.
    
    Failing to do this resulted in errors during display audio codec probe,
    and failures during resume from suspend.
    
    Note: We may also need to disable the bit afterwards, but there are
    still unresolved issues with that.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111214
    Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
    Signed-off-by: Jani Nikula <jani.nikula@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20191003085531.30990-1-kai.vehmanen@linux.intel.com

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.