Bug 108511

Summary: [CI][BAT] igt@pm_rpm@module-reload - dmesg-fail - Failed assertion: c1->mmWidth == c2->mmWidth
Product: DRI Reporter: Martin Peres <martin.peres>
Component: DRM/IntelAssignee: Anshuman Gupta <anshuman.gupta>
Status: RESOLVED MOVED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: high CC: anshuman.gupta, intel-gfx-bugs, james.ausmus
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard: ReadyForDev
i915 platform: SKL i915 features: display/LSPCON
Attachments:
Description Flags
attachment-14057-0.html none

Description Martin Peres 2018-10-22 07:40:33 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4685/fi-glk-j4005/igt@pm_rpm@module-reload.html

Starting subtest: module-reload
(pm_rpm:3294) CRITICAL: Test assertion failure function assert_drm_connectors_equal, file ../tests/pm_rpm.c:510:
(pm_rpm:3294) CRITICAL: Failed assertion: c1->mmWidth == c2->mmWidth
(pm_rpm:3294) CRITICAL: error: 620 != 0
Subtest module-reload failed.
Comment 1 James Ausmus 2018-10-24 16:48:13 UTC
Is this perhaps related to the "[drm:drm_mode_config_cleanup] *ERROR* connector HDMI-A-1 leaked!" earlier in the dmesg?
Comment 2 Anshuman Gupta 2018-11-15 12:05:17 UTC
It is not reproducible anymore, i have tested it on drm-tip 4.19.0-rc8 and on drm internal latest.

sudo ./tests/pm_rpm --run-subtest module-reload
IGT-Version: 1.23-gea37f9c (x86_64) (Linux: 4.19.0-rc8-custom x86_64)
Runtime PM support: 1
PC8 residency support: 0
DMC: fw loaded: yes
Starting subtest: module-reload
Reloading i915 with disable_display=1 mmio_debug=-1

Runtime PM support: 1
PC8 residency support: 0
DMC: fw loaded: yes
Reloading i915 with mmio_debug=-1

Runtime PM support: 1
PC8 residency support: 0

PS: When tested on latest drm-tip, test is getting killed due to "unable to handle kernel paging request" at kernel, which doesn't look like a similar issue.

[  246.498589] BUG: unable to handle kernel paging request at ffffffffc03802f8
[  246.498594] PGD 17440c067 P4D 17440c067 PUD 17440e067 PMD 177950067 PTE 0
[  246.498599] Oops: 0000 [#1] SMP PTI
[  246.498603] CPU: 1 PID: 2429 Comm: pm_rpm Tainted: G  R               4.20.0-rc1+ #1
Comment 3 Martin Peres 2018-11-15 15:38:10 UTC
(In reply to Anshuman Gupta from comment #2)
> It is not reproducible anymore, i have tested it on drm-tip 4.19.0-rc8 and
> on drm internal latest.
> 
> sudo ./tests/pm_rpm --run-subtest module-reload
> IGT-Version: 1.23-gea37f9c (x86_64) (Linux: 4.19.0-rc8-custom x86_64)
> Runtime PM support: 1
> PC8 residency support: 0
> DMC: fw loaded: yes
> Starting subtest: module-reload
> Reloading i915 with disable_display=1 mmio_debug=-1
> 
> Runtime PM support: 1
> PC8 residency support: 0
> DMC: fw loaded: yes
> Reloading i915 with mmio_debug=-1
> 
> Runtime PM support: 1
> PC8 residency support: 0
> 
> PS: When tested on latest drm-tip, test is getting killed due to "unable to
> handle kernel paging request" at kernel, which doesn't look like a similar
> issue.
> 
> [  246.498589] BUG: unable to handle kernel paging request at
> ffffffffc03802f8
> [  246.498594] PGD 17440c067 P4D 17440c067 PUD 17440e067 PMD 177950067 PTE 0
> [  246.498599] Oops: 0000 [#1] SMP PTI
> [  246.498603] CPU: 1 PID: 2429 Comm: pm_rpm Tainted: G  R              
> 4.20.0-rc1+ #1

The reproduction rate has been low, to say the least. So don't assume this is fixed :s

I would say there are easier bugs to get started with: https://intel-gfx-ci.01.org/cibuglog/ <-- order by reproduction rate and pick one you like!
Comment 4 Anshuman Gupta 2018-11-28 13:34:41 UTC
(In reply to Martin Peres from comment #3)
> (In reply to Anshuman Gupta from comment #2)
> > It is not reproducible anymore, i have tested it on drm-tip 4.19.0-rc8 and
> > on drm internal latest.
> > 
> > sudo ./tests/pm_rpm --run-subtest module-reload
> > IGT-Version: 1.23-gea37f9c (x86_64) (Linux: 4.19.0-rc8-custom x86_64)
> > Runtime PM support: 1
> > PC8 residency support: 0
> > DMC: fw loaded: yes
> > Starting subtest: module-reload
> > Reloading i915 with disable_display=1 mmio_debug=-1
> > 
> > Runtime PM support: 1
> > PC8 residency support: 0
> > DMC: fw loaded: yes
> > Reloading i915 with mmio_debug=-1
> > 
> > Runtime PM support: 1
> > PC8 residency support: 0
> > 
> > PS: When tested on latest drm-tip, test is getting killed due to "unable to
> > handle kernel paging request" at kernel, which doesn't look like a similar
> > issue.
> > 
> > [  246.498589] BUG: unable to handle kernel paging request at
> > ffffffffc03802f8
> > [  246.498594] PGD 17440c067 P4D 17440c067 PUD 17440e067 PMD 177950067 PTE 0
> > [  246.498599] Oops: 0000 [#1] SMP PTI
> > [  246.498603] CPU: 1 PID: 2429 Comm: pm_rpm Tainted: G  R              
> > 4.20.0-rc1+ #1
> 
> The reproduction rate has been low, to say the least. So don't assume this
> is fixed :s
> 
> I would say there are easier bugs to get started with:
> https://intel-gfx-ci.01.org/cibuglog/ <-- order by reproduction rate and
> pick one you like!

I have tried to reproduced the issue, but not get any luck  to reproduce it.

However, i had analyzed the logs and able to conclude it logically from the logs.

After resuming there is hot plug event from hdmi and it is reading the invalid EDID and tries to read EDID after enabling GPIO bit banging, there also it has failed to read EDID and  HDMI connector status changed from connected to disconnected.

<7> [343.557663] [drm:drm_helper_probe_single_connector_modes] [CONNECTOR:125:HDMI-A-2]
<7> [343.557756] [drm:intel_hdmi_detect [i915]] [CONNECTOR:125:HDMI-A-2]
<7> [343.569930] [drm:intel_runtime_resume [i915]] Resuming device
--------------------------------------------------------------------------
<4> [343.632835] i915 0000:00:02.0: HDMI-A-2: EDID is invalid:
<4> [343.632841] \x09[00] BAD  00 ff ff ff ff ff ff 00 05 e3 79 28 2b 09 00 00
<4> [343.632844] \x09[00] BAD  32 1b 01 03 80 3e 22 78 2a 08 a5 a2 57 4f a2 28
<4> [343.632847] \x09[00] BAD  0f 50 54 bf ef 00 d1 c0 b3 00 95 00 81 80 80 40
<4> [343.632849] \x09[00] BAD  81 c0 01 01 01 01 4d d0 00 a0 f0 70 3e 80 30 20
<4> [343.632852] \x09[00] BAD  20 00 2d 55 21 00 00 1a a3 66 00 a0 f0 70 1f 80
<4> [343.632854] \x09[00] BAD  30 20 35 00 6d 55 21 00 00 1a 00 00 00 fc 00 55
<4> [343.632857] \x09[00] BAD  32 38 37 39 47 36 0a 20 20 20 20 20 00 00 00 fd
<4> [343.632859] \x09[00] BAD  00 17 50 1e 8c 3c 00 0a 20 20 20 20 20 20 01 88
<7> [343.632933] [drm:intel_hdmi_set_edid [i915]] HDMI GMBUS EDID read failed, retry using GPIO bit-banging
<7> [343.632989] [drm:intel_gmbus_force_bit [i915]] enabling bit-banging on i915 gmbus dpb. force bit now 1
<7> [343.704493] [drm:drm_do_probe_ddc_edid] drm: skipping non-existent adapter i915 gmbus dpb
<7> [343.704595] [drm:intel_gmbus_force_bit [i915]] disabling bit-banging on i915 gmbus dpb. force bit now 0
<7> [343.705207] [drm:do_gmbus_xfer [i915]] GMBUS [i915 gmbus dpb] NAK for addr: 0040 w(1)
<7> [343.705259] [drm:do_gmbus_xfer [i915]] GMBUS [i915 gmbus dpb] NAK on first message, retry
<7> [343.705817] [drm:do_gmbus_xfer [i915]] GMBUS [i915 gmbus dpb] NAK for addr: 0040 w(1)
--------------------------------------------------------------------------------
<7> [343.710557] [drm:drm_helper_probe_single_connector_modes] [CONNECTOR:125:HDMI-A-2] status updated from connected to disconnected
<7> [343.710563] [drm:drm_helper_probe_single_connector_modes] [CONNECTOR:125:HDMI-A-2] disconnected


HDMI EDID read may get failed due to poor device or poor hdmi cable while resuming, which caused mismatch in connector presuspend width and during suspend.  

I had a discussion with Shashank,  this is really specific to HDMI DRM policy to continue with modeset despite an Invalid EDID.
Comment 5 Anshuman Gupta 2018-11-28 13:35:11 UTC
(In reply to Martin Peres from comment #3)
> (In reply to Anshuman Gupta from comment #2)
> > It is not reproducible anymore, i have tested it on drm-tip 4.19.0-rc8 and
> > on drm internal latest.
> > 
> > sudo ./tests/pm_rpm --run-subtest module-reload
> > IGT-Version: 1.23-gea37f9c (x86_64) (Linux: 4.19.0-rc8-custom x86_64)
> > Runtime PM support: 1
> > PC8 residency support: 0
> > DMC: fw loaded: yes
> > Starting subtest: module-reload
> > Reloading i915 with disable_display=1 mmio_debug=-1
> > 
> > Runtime PM support: 1
> > PC8 residency support: 0
> > DMC: fw loaded: yes
> > Reloading i915 with mmio_debug=-1
> > 
> > Runtime PM support: 1
> > PC8 residency support: 0
> > 
> > PS: When tested on latest drm-tip, test is getting killed due to "unable to
> > handle kernel paging request" at kernel, which doesn't look like a similar
> > issue.
> > 
> > [  246.498589] BUG: unable to handle kernel paging request at
> > ffffffffc03802f8
> > [  246.498594] PGD 17440c067 P4D 17440c067 PUD 17440e067 PMD 177950067 PTE 0
> > [  246.498599] Oops: 0000 [#1] SMP PTI
> > [  246.498603] CPU: 1 PID: 2429 Comm: pm_rpm Tainted: G  R              
> > 4.20.0-rc1+ #1
> 
> The reproduction rate has been low, to say the least. So don't assume this
> is fixed :s
> 
> I would say there are easier bugs to get started with:
> https://intel-gfx-ci.01.org/cibuglog/ <-- order by reproduction rate and
> pick one you like!

I have tried to reproduced the issue, but not get any luck  to reproduce it.

However, i had analyzed the logs and able to conclude it logically from the logs.

After resuming there is hot plug event from hdmi and it is reading the invalid EDID and tries to read EDID after enabling GPIO bit banging, there also it has failed to read EDID and  HDMI connector status changed from connected to disconnected.

<7> [343.557663] [drm:drm_helper_probe_single_connector_modes] [CONNECTOR:125:HDMI-A-2]
<7> [343.557756] [drm:intel_hdmi_detect [i915]] [CONNECTOR:125:HDMI-A-2]
<7> [343.569930] [drm:intel_runtime_resume [i915]] Resuming device
--------------------------------------------------------------------------
<4> [343.632835] i915 0000:00:02.0: HDMI-A-2: EDID is invalid:
<4> [343.632841] \x09[00] BAD  00 ff ff ff ff ff ff 00 05 e3 79 28 2b 09 00 00
<4> [343.632844] \x09[00] BAD  32 1b 01 03 80 3e 22 78 2a 08 a5 a2 57 4f a2 28
<4> [343.632847] \x09[00] BAD  0f 50 54 bf ef 00 d1 c0 b3 00 95 00 81 80 80 40
<4> [343.632849] \x09[00] BAD  81 c0 01 01 01 01 4d d0 00 a0 f0 70 3e 80 30 20
<4> [343.632852] \x09[00] BAD  20 00 2d 55 21 00 00 1a a3 66 00 a0 f0 70 1f 80
<4> [343.632854] \x09[00] BAD  30 20 35 00 6d 55 21 00 00 1a 00 00 00 fc 00 55
<4> [343.632857] \x09[00] BAD  32 38 37 39 47 36 0a 20 20 20 20 20 00 00 00 fd
<4> [343.632859] \x09[00] BAD  00 17 50 1e 8c 3c 00 0a 20 20 20 20 20 20 01 88
<7> [343.632933] [drm:intel_hdmi_set_edid [i915]] HDMI GMBUS EDID read failed, retry using GPIO bit-banging
<7> [343.632989] [drm:intel_gmbus_force_bit [i915]] enabling bit-banging on i915 gmbus dpb. force bit now 1
<7> [343.704493] [drm:drm_do_probe_ddc_edid] drm: skipping non-existent adapter i915 gmbus dpb
<7> [343.704595] [drm:intel_gmbus_force_bit [i915]] disabling bit-banging on i915 gmbus dpb. force bit now 0
<7> [343.705207] [drm:do_gmbus_xfer [i915]] GMBUS [i915 gmbus dpb] NAK for addr: 0040 w(1)
<7> [343.705259] [drm:do_gmbus_xfer [i915]] GMBUS [i915 gmbus dpb] NAK on first message, retry
<7> [343.705817] [drm:do_gmbus_xfer [i915]] GMBUS [i915 gmbus dpb] NAK for addr: 0040 w(1)
--------------------------------------------------------------------------------
<7> [343.710557] [drm:drm_helper_probe_single_connector_modes] [CONNECTOR:125:HDMI-A-2] status updated from connected to disconnected
<7> [343.710563] [drm:drm_helper_probe_single_connector_modes] [CONNECTOR:125:HDMI-A-2] disconnected


HDMI EDID read may get failed due to poor device or poor hdmi cable while resuming, which caused mismatch in connector presuspend width and during suspend.  

I had a discussion with Shashank,  this is really specific to HDMI DRM policy to continue with modeset despite an Invalid EDID.
Comment 6 Jani Saarinen 2018-11-28 15:40:09 UTC
Is this real display or dummy on GLK? 
https://intel-gfx-ci.01.org/hardware/fi-glk-j4005/i915_display_info.txt
Comment 7 Martin Peres 2018-11-28 15:52:03 UTC
(In reply to Jani Saarinen from comment #6)
> Is this real display or dummy on GLK? 
> https://intel-gfx-ci.01.org/hardware/fi-glk-j4005/i915_display_info.txt

Hard to tell, https://intel-gfx-ci.01.org/hardware.html#fi-glk-j4005 is not saying so I guess it means we just can't get the EDID(?).
Comment 8 CI Bug Log 2019-01-24 12:52:06 UTC
A CI Bug Log filter associated to this bug has been updated:

{- GLK: igt@pm_rpm@module-reload - dmesg-fail - Failed assertion: c1-&gt;mmWidth == c2-&gt;mmWidth -}
{+ SKL GLK: igt@pm_rpm@module-reload - dmesg-fail - Failed assertion: c1-&gt;mmWidth == c2-&gt;mmWidth +}

 No new failures caught with the new filter
Comment 9 Anshuman Gupta 2019-01-24 12:52:34 UTC
Created attachment 143223 [details]
attachment-14057-0.html

I will be Out of Office from WW04.1 to WW01.3. I will be having limited access for emails, please accept deply in response.

I will be reachable on phone 973924638.

Thanks ,
Anshuman
+91-9739274638
Comment 10 CI Bug Log 2019-01-25 15:00:29 UTC
A CI Bug Log filter associated to this bug has been updated:

{- SKL GLK: igt@pm_rpm@module-reload - dmesg-fail - Failed assertion: c1-&gt;mmWidth == c2-&gt;mmWidth -}
{+ SKL GLK: igt@pm_rpm@module-reload - fail / dmesg-fail - Failed assertion: c1-&gt;mmWidth == c2-&gt;mmWidth +}

New failures caught by the filter:

* https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5478/fi-skl-6770hq/igt@pm_rpm@module-reload.html
* https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5479/fi-skl-6770hq/igt@pm_rpm@module-reload.html
* https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5483/fi-skl-6770hq/igt@pm_rpm@module-reload.html
Comment 11 CI Bug Log 2019-02-06 16:24:32 UTC
A CI Bug Log filter associated to this bug has been updated:

{- SKL GLK: igt@pm_rpm@module-reload - fail / dmesg-fail - Failed assertion: c1-&gt;mmWidth == c2-&gt;mmWidth -}
{+ SKL GLK: igt@pm_rpm@(module-reload|drm-resources-equal) - fail / dmesg-fail - Failed assertion: c1-&gt;mmWidth == c2-&gt;mmWidth +}

New failures caught by the filter:

* https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_206/fi-skl-6770hq/igt@pm_rpm@drm-resources-equal.html
* https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_208/fi-skl-6770hq/igt@pm_rpm@drm-resources-equal.html
Comment 12 Anshuman Gupta 2019-02-11 09:41:02 UTC
(In reply to CI Bug Log from comment #11)
> A CI Bug Log filter associated to this bug has been updated:
> 
> {- SKL GLK: igt@pm_rpm@module-reload - fail / dmesg-fail - Failed assertion:
> c1-&gt;mmWidth == c2-&gt;mmWidth -}
> {+ SKL GLK: igt@pm_rpm@(module-reload|drm-resources-equal) - fail /
> dmesg-fail - Failed assertion: c1-&gt;mmWidth == c2-&gt;mmWidth +}
> 
> New failures caught by the filter:
> 
> *
> https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_206/fi-skl-6770hq/
> igt@pm_rpm@drm-resources-equal.html
> *
> https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_208/fi-skl-6770hq/
> igt@pm_rpm@drm-resources-equal.html

Setup fi-skl-6770hq (https://intel-gfx-ci.01.org/hardware.html) to have only one DP display connected but after analyzing logs, it seems that there are downstream ports are present. Can somebody confirm about this setup whether it is using any Downstream ports.
As this FDO is reported recently with same setup multiple times.
Comment 13 Tomi Sarvela 2019-02-12 07:24:11 UTC
I'm sorry, but can you explain me what 'Downstream port' is in this context?
Comment 14 Lakshmi 2019-02-28 15:55:57 UTC
Anshuman unable to reproduce this issue locally. Further investigation is needed. Priority is dropped to high.
Comment 15 Anshuman Gupta 2019-04-01 08:27:23 UTC
Not able to reproduce this task locally, run this igt test for continuous three days.

This bug has been seen on Can we get access to this particular fi-skl-6770hq/ CI machine.

@Lakshmi can we get access to this machine ?
Comment 16 Don Hiatt 2019-06-17 21:27:53 UTC
This bug hasn't been seen in months and should be closed unless there are any objections.

Anshuman Gupta : Do you have any objection to closing this bug or do you still require more information?
Comment 17 Anshuman Gupta 2019-06-18 05:44:23 UTC
yes we can close this as i was unable to reproduce this issue.
Comment 18 Lakshmi 2019-07-18 07:25:59 UTC
Reopening, this issue is still happening very regularly and history can be found from CI bug log.
Comment 19 Anshuman Gupta 2019-07-22 09:04:24 UTC
This issue is only has seen with CI H/W fi-skl-6770hq, this CI H/W uses LSPCON.
https://intel-gfx-ci.01.org/hardware/fi-skl-6770hq/i915_display_info.txt tells that DP "branch device present: yes" and device type is HDMI.
Probable root cause is due to DP-1 connector is disconnected when IGT module-reload test probe DP1 connector modes, which resulted  in  mismatch of  DP1 connector physical width.
<7> [275.707285] [drm:intel_runtime_suspend [i915]] Suspending device <7> [275.710274] [drm:intel_runtime_suspend [i915]] Device suspended 
<7> [275.795462] [drm:drm_helper_probe_single_connector_modes] [CONNECTOR:86:DP-1] 
-------------------------------------------------------------------------------------------------------------------------------
<7> [276.074575] [drm:lspcon_wait_mode [i915]] Current LSPCON mode PCON 
<7> [276.075459] [drm:intel_dp_read_dpcd [i915]] DPCD: 12 14 c4 01 01 15 01 81 02 01 04 01 0f 00 01 
<7> [276.076282] [drm:drm_dp_read_desc] DP branch: OUI 00-1c-f8 dev-ID 175IB0 HW-rev 1.0 SW-rev 7.64 quirks 0x0000 
<7> [276.077021] [drm:drm_helper_probe_single_connector_modes] [CONNECTOR:86:DP-1] status updated from connected to disconnected 
<7> [276.077024] [drm:drm_helper_probe_single_connector_modes] [CONNECTOR:86:DP-1] disconnected 
<7> [276.077204] [drm:intel_dp_detect [i915]] [CONNECTOR:86:DP-1]
Comment 20 Vanshidhar Konda 2019-11-19 19:51:38 UTC
This error is occuring on SKL 6770HQ CI machine every 4-5 runs for the past two months. It didn't occur from August 15 - Sept 17, 2019. Also, the suspected reason for failure doesn't seem to be occurring on the recent runs - DP-1 connects/reconnects after device resumes from suspend.
Comment 21 Martin Peres 2019-11-29 17:58:31 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/178.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.