Bug 108840 - [CI][BAT] igt@*pm_rpm@* - incomplete
Summary: [CI][BAT] igt@*pm_rpm@* - incomplete
Status: RESOLVED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: Other All
: highest normal
Assignee: Anshuman Gupta
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
: 108768 108864 108990 (view as bug list)
Depends on:
Blocks: 109507
  Show dependency treegraph
 
Reported: 2018-11-22 16:04 UTC by Lakshmi
Modified: 2019-04-11 11:42 UTC (History)
4 users (show)

See Also:
i915 platform: ICL
i915 features: GEM/Other, power/runtime PM


Attachments
DMESG logs (6.10 MB, text/plain)
2019-04-04 13:46 UTC, Anshuman Gupta
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Comment 1 Chris Wilson 2018-11-22 21:39:38 UTC
acpi idle gpf?
Comment 2 Chris Wilson 2018-11-22 21:41:44 UTC
*** Bug 108768 has been marked as a duplicate of this bug. ***
Comment 3 Martin Peres 2018-11-26 11:39:07 UTC
*** Bug 108864 has been marked as a duplicate of this bug. ***
Comment 4 Lakshmi 2018-12-03 12:48:35 UTC
Last seen this issue CI_DRM_5236_full (1 day, 22 hours / 5 runs ago).
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5236/shard-iclb6/igt@pm_rpm@cursor-dpms.html
Comment 6 Anshuman Gupta 2018-12-18 11:44:00 UTC
(In reply to Lakshmi from comment #5)
> One more instance
> https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_168/fi-icl-u3/
> igt@pm_rpm@debugfs-read.html

It seems i am able to reproduce the issue multiple times, but not able to get the page fault crash dmesg log, my system is resetting even before printing any dmesg log, So not sure if crash is same as page fault in acpi idle. 

As i see acpi idle gpf crash logs are available for CI_DRM_5180 and CI_DRM_5184.
From where i can get the vmlinux kernel object file and System.map file for CI_DRM_5180 and CI_DRM_5184. It will helpful to debug the page fault in acpi_idle_enter code.
Comment 7 Anshuman Gupta 2019-01-02 12:23:15 UTC
(In reply to Lakshmi from comment #5)
> One more instance
> https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_168/fi-icl-u3/
> igt@pm_rpm@debugfs-read.html

ICL CPU id family patches are not merged to up streamed mainline kernel, due to that ICL H/W still using ACPI idle driver, This ACPI idle page fault will not be there with intel_idle driver, this issue should be fixed once ICL CPU id patches will be public.
Comment 8 CI Bug Log 2019-02-04 16:03:51 UTC
A CI Bug Log filter associated to this bug has been updated:

{- shard-iclb6  fi-icl-u3: igt@pm_rpm@* - incomplete -}
{+ shard-iclb6  fi-icl-u3 shard-iclb1: igt@pm_rpm@* - incomplete +}

New failures caught by the filter:

* https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5519/shard-iclb1/igt@pm_rpm@modeset-lpsp-stress.html
Comment 9 Anshuman Gupta 2019-02-11 08:27:23 UTC
This crash is different from earlier crashes, earlier crashes were in to acpi idle driver, but this is different one. Looks a different bug to  me.
Comment 10 Lakshmi 2019-02-12 07:51:42 UTC
I believe this error is related to this bug.
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5590/fi-icl-y/igt@pm_rpm@module-reload.html
Comment 11 Lakshmi 2019-02-12 07:54:10 UTC
(In reply to Anshuman Gupta from comment #9)
> This crash is different from earlier crashes, earlier crashes were in to
> acpi idle driver, but this is different one. Looks a different bug to  me.

Thanks for your comment here. I will create a new bug for this failure if I don't find suitable existing bug for this failure.
Comment 12 CI Bug Log 2019-02-12 07:55:19 UTC
A CI Bug Log filter associated to this bug has been updated:

{- shard-iclb6  fi-icl-u3 shard-iclb1: igt@pm_rpm@* - incomplete -}
{+ fi-icl-y shard-iclb6  fi-icl-u3 shard-iclb1: igt@pm_rpm@* - incomplete +}

New failures caught by the filter:

* https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5590/fi-icl-y/igt@pm_rpm@module-reload.html
Comment 13 Lakshmi 2019-02-12 07:57:16 UTC
Setting the priority as highest as the failure seen in BAT.
Comment 14 Anshuman Gupta 2019-02-13 09:17:48 UTC
(In reply to Lakshmi from comment #13)
> Setting the priority as highest as the failure seen in BAT.

(In reply to CI Bug Log from comment #12)
> A CI Bug Log filter associated to this bug has been updated:
> 
> {- shard-iclb6  fi-icl-u3 shard-iclb1: igt@pm_rpm@* - incomplete -}
> {+ fi-icl-y shard-iclb6  fi-icl-u3 shard-iclb1: igt@pm_rpm@* - incomplete +}
> 
> New failures caught by the filter:
> 
> *
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5590/fi-icl-y/
> igt@pm_rpm@module-reload.html

No information is available in this log, considering the result incomplete it is either a kernel crash or hung but no such information about panic or hung is available from logs.
Comment 15 CI Bug Log 2019-03-04 07:30:25 UTC
A CI Bug Log filter associated to this bug has been updated:

{- fi-icl-y shard-iclb6  fi-icl-u3 shard-iclb1: igt@pm_rpm@* - incomplete -}
{+ ICL: igt@*pm_rpm@* - incomplete +}

New failures caught by the filter:

* https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4868/shard-iclb4/igt@i915_pm_rpm@gem-execbuf-stress-extra-wait.html
Comment 16 Lakshmi 2019-03-22 08:39:18 UTC
Anshuman, are there any changes done recently to address this bug? Last seen drmtip_244 (5 days, 11 hours / 87 runs ago)on fi-icl-y.
If not fi-icl-y, last seen on shards IGT_4878_full (1 week, 6 days / 168 runs ago).
Comment 17 Lakshmi 2019-04-03 07:41:29 UTC
Update: Anshuman will try to reproduce this issue with latest BIOS to get more information about his issue.
Comment 18 Martin Peres 2019-04-03 07:50:54 UTC
Seems like the common denominator is that i915_gem_wait_for_idle() hangs (or never completes), which triggers a hang, which reboots the machine because CI sets panic=1.

There is an assumption that a page fault happening in the ACPI idle driver might be responsible for this, but I'm CC:ing Francesco to verify that the logic is sound with his engineers, while waiting for the ACPI driver fixes to be ready and landed in Linux.
Comment 19 Martin Peres 2019-04-03 13:10:53 UTC
(In reply to Martin Peres from comment #18)
> Seems like the common denominator is that i915_gem_wait_for_idle() hangs (or
> never completes), which triggers a hang, which reboots the machine because
> CI sets panic=1.
> 
> There is an assumption that a page fault happening in the ACPI idle driver
> might be responsible for this, but I'm CC:ing Francesco to verify that the
> logic is sound with his engineers, while waiting for the ACPI driver fixes
> to be ready and landed in Linux.

Daniel Vetter suggests that we should write a patch for core-for-CI that disables the ACPI driver for ICL only. Who is a taker?
Comment 20 Lakshmi 2019-04-04 07:58:09 UTC
Anshuman, any updates here?
Comment 21 Anshuman Gupta 2019-04-04 08:48:27 UTC
(In reply to Martin Peres from comment #19)
> (In reply to Martin Peres from comment #18)
> > Seems like the common denominator is that i915_gem_wait_for_idle() hangs (or
> > never completes), which triggers a hang, which reboots the machine because
> > CI sets panic=1.
> > 
> > There is an assumption that a page fault happening in the ACPI idle driver
> > might be responsible for this, but I'm CC:ing Francesco to verify that the
> > logic is sound with his engineers, while waiting for the ACPI driver fixes
> > to be ready and landed in Linux.
> 
> Daniel Vetter suggests that we should write a patch for core-for-CI that
> disables the ACPI driver for ICL only. Who is a taker?

you mean to disable ACPI idle driver or disable CPU idle completely?
Comment 22 Lakshmi 2019-04-04 12:46:15 UTC
(In reply to Anshuman Gupta from comment #21)
> (In reply to Martin Peres from comment #19)
> > (In reply to Martin Peres from comment #18)
> > > Seems like the common denominator is that i915_gem_wait_for_idle() hangs (or
> > > never completes), which triggers a hang, which reboots the machine because
> > > CI sets panic=1.
> > > 
> > > There is an assumption that a page fault happening in the ACPI idle driver
> > > might be responsible for this, but I'm CC:ing Francesco to verify that the
> > > logic is sound with his engineers, while waiting for the ACPI driver fixes
> > > to be ready and landed in Linux.
> > 
> > Daniel Vetter suggests that we should write a patch for core-for-CI that
> > disables the ACPI driver for ICL only. Who is a taker?
> 
> you mean to disable ACPI idle driver or disable CPU idle completely?

Anshuman, Disable the ACPI idle driver.
Comment 23 Anshuman Gupta 2019-04-04 13:42:09 UTC
(In reply to Lakshmi from comment #20)
> Anshuman, any updates here?

We ran the test on our local setup with latest drm tip and latest BIOS V3121.
i915_pm_rpm doesn't complete and there is hard lockup after running igt subtest "gem-execbuf-stress" there is hard lock up.
 
[ 1496.982609] e1000e 0000:ad:00.0 enp173s0: Detected Hardware Unit Hang:
                 TDH                  <0>
                 TDT                  <2>
                 next_to_use          <2>
                 next_to_clean        <0>
               buffer_info[next_to_clean]:
                 time_stamp           <100048cbb>
                 next_to_watch        <0>
                 jiffies              <1000490c0>
                 next_to_watch.status <0>
               MAC Status             <383>
               PHY Status             <792d>
               PHY 1000BASE-T Status  <3800>

This hang is believed to originated from system-suspend and system-hibernate sub test. I wonder if this is similar to CI hang or different hang.
Comment 24 Anshuman Gupta 2019-04-04 13:46:21 UTC
Created attachment 143861 [details]
DMESG logs

BA setup logs
Comment 25 Chris Wilson 2019-04-04 14:10:35 UTC
(In reply to Anshuman Gupta from comment #23)
> (In reply to Lakshmi from comment #20)
> > Anshuman, any updates here?
> 
> We ran the test on our local setup with latest drm tip and latest BIOS V3121.
> i915_pm_rpm doesn't complete and there is hard lockup after running igt
> subtest "gem-execbuf-stress" there is hard lock up.
>  
> [ 1496.982609] e1000e 0000:ad:00.0 enp173s0: Detected Hardware Unit Hang:
>                  TDH                  <0>
>                  TDT                  <2>
>                  next_to_use          <2>
>                  next_to_clean        <0>
>                buffer_info[next_to_clean]:
>                  time_stamp           <100048cbb>
>                  next_to_watch        <0>
>                  jiffies              <1000490c0>
>                  next_to_watch.status <0>
>                MAC Status             <383>
>                PHY Status             <792d>
>                PHY 1000BASE-T Status  <3800>

Keep filing those to e1000e maintainers.

> This hang is believed to originated from system-suspend and system-hibernate
> sub test. I wonder if this is similar to CI hang or different hang.

We can't tell, if netconsole is unreliable due to e1000e, use serial console.
Comment 26 Anshuman Gupta 2019-04-08 10:09:56 UTC
(In reply to Martin Peres from comment #19)
> (In reply to Martin Peres from comment #18)
> > Seems like the common denominator is that i915_gem_wait_for_idle() hangs (or
> > never completes), which triggers a hang, which reboots the machine because
> > CI sets panic=1.
> > 
> > There is an assumption that a page fault happening in the ACPI idle driver
> > might be responsible for this, but I'm CC:ing Francesco to verify that the
> > logic is sound with his engineers, while waiting for the ACPI driver fixes
> > to be ready and landed in Linux.
> 
> Daniel Vetter suggests that we should write a patch for core-for-CI that
> disables the ACPI driver for ICL only. Who is a taker?

patch is ready, if we disable ACPI idle driver, there is no idle driver available for ICL, so as consequence of it cpu idle also gets disabled completely. 

As it is a hack, is it ok to send this patch to Intel-gfx <intelgfx-@lists.freedesktop.org>  ?
Comment 27 Jani Saarinen 2019-04-08 10:49:25 UTC
BIOS was updated on ICL-U3. 
See: 
<6>[    0.000000] DMI: Intel Corporation Ice Lake Client Platform/IceLake U DDR4 SODIMM PD RVP TLC, BIOS ICLSFWR1.R00.3121.A00.1903190527 03/19/2019
Comment 28 Martin Peres 2019-04-08 11:08:19 UTC
(In reply to Anshuman Gupta from comment #26)
> (In reply to Martin Peres from comment #19)
> > (In reply to Martin Peres from comment #18)
> > > Seems like the common denominator is that i915_gem_wait_for_idle() hangs (or
> > > never completes), which triggers a hang, which reboots the machine because
> > > CI sets panic=1.
> > > 
> > > There is an assumption that a page fault happening in the ACPI idle driver
> > > might be responsible for this, but I'm CC:ing Francesco to verify that the
> > > logic is sound with his engineers, while waiting for the ACPI driver fixes
> > > to be ready and landed in Linux.
> > 
> > Daniel Vetter suggests that we should write a patch for core-for-CI that
> > disables the ACPI driver for ICL only. Who is a taker?
> 
> patch is ready, if we disable ACPI idle driver, there is no idle driver
> available for ICL, so as consequence of it cpu idle also gets disabled
> completely. 
> 
> As it is a hack, is it ok to send this patch to Intel-gfx
> <intelgfx-@lists.freedesktop.org>  ?

Yes, please explain why you are doing this, and that this is for the core-for-CI, and ICL-only. This will give you some testing and you can get comments also on this :)
Comment 29 Anshuman Gupta 2019-04-08 13:14:48 UTC
(In reply to Jani Saarinen from comment #27)
> BIOS was updated on ICL-U3. 
> See: 
> <6>[    0.000000] DMI: Intel Corporation Ice Lake Client Platform/IceLake U
> DDR4 SODIMM PD RVP TLC, BIOS ICLSFWR1.R00.3121.A00.1903190527 03/19/2019

Issue is observed even with BIOS 3121 at BA.

[38218.404599] Sending NMI from CPU 0 to CPUs 1-7:
[38218.409133] NMI backtrace for cpu 4 skipped: idling at acpi_processor_ffh_cstate_enter+0x69/0xb0
[38218.409191] NMI backtrace for cpu 6 skipped: idling at acpi_processor_ffh_cstate_enter+0x69/0xb0
[38218.409192] NMI backtrace for cpu 2 skipped: idling at acpi_processor_ffh_cstate_enter+0x69/0xb0
[38218.409210] NMI backtrace for cpu 5 skipped: idling at acpi_processor_ffh_cstate_enter+0x69/0xb0
[38218.409211] NMI backtrace for cpu 1 skipped: idling at acpi_processor_ffh_cstate_enter+0x69/0xb0
[38218.409253] NMI backtrace for cpu 7 skipped: idling at acpi_processor_ffh_cstate_enter+0x69/0xb0
[38218.409254] NMI backtrace for cpu 3 skipped: idling at acpi_processor_ffh_cstate_enter+0x69/0xb0
[38218.410248] Kernel panic - not syncing: hung_task: blocked tasks
[38218.476356] Dumping ftrace buffer:
[38218.479748] ---------------------------------
[38218.484092] CPU:1 [LOST 5491 EVENTS]
[38218.484092] i915_pm_-18748   1d..1 31453892539us : 0xffffffffa0518b5d: rcs0 in[0]:  ctx=64.1, fence 8b9:44 (current 42), prio=6
[38218.499045] kworker/-19088   1.... 31570414263us : 0xffffffffa04f0ab8: rcs0 fence 8b9:80, current 80
[38218.508127] kworker/-19088   1.... 31570414276us : 0xffffffffa04f0d51: __retire_engine_request(rcs0) fence 8b9:80, current 80
[38218.519363] kworker/-19088   1.... 31570414278us : 0xffffffffa04d6095: 
[38218.525945] kworker/-19088   1.... 31570517192us : 0xffffffffa04b6765: awake?=yes
[38218.533388] kworker/-19088   1.... 31570517240us : 0xffffffffa04f3295: rcs0 fence 89c:96
[38218.541433] kworker/-19088   1d..1 31570517275us : 0xffffffffa0517826: rcs0 cs-irq head=1, tail=1
[38218.550259] kworker/-19088   1d..1 31570517277us : 0xffffffffa04f2fa2: rcs0 fence 89c:96 -> current 94
[38218.559512] kworker/-19088   1d..1 31570517294us : 0xffffffffa0518b5d: rcs0 in[0]:  ctx=0.1, fence 89c:96 (current 94), prio=-8186
[38218.571179] kworker/-19088   1.... 31570517300us : 0xffffffffa04da095: flags=12 (locked), timeout=200
[38218.580344] kworker/-19088   1.... 31570519413us : 0xffffffffa04f0ab8: rcs0 fence 89c:96, current 96
[38218.589426] kworker/-19088   1.... 31570519416us : 0xffffffffa04f0d51: __retire_engine_request(rcs0) fence 89c:96, current 96
[38218.600664] kworker/-19088   1.... 31570519445us : 0xffffffffa04d45e5: 
[38218.607245] i915_pm_-18748   1.... 31645033342us : 0xffffffffa04e27e5: 
[38218.613825] i915_pm_-18748   1.... 31645033352us : 0xffffffffa04b6765: awake?=no
[38218.621179] i915_pm_-18748   1.... 31645033354us : 0xffffffffa04da095: flags=12 (locked), timeout=200
[38218.630345] i915_pm_-18748   1.... 31645033358us : 0xffffffffa04da095: flags=12 (locked), timeout=9223372036854775807 (forever)
[38218.641756] i915_pm_-18748   1.... 31645688983us : 0xffffffffa04e2745: 
[38218.648339] i915_pm_-18748   1.... 31645689156us : 0xffffffffa050bc95: 
[38218.654917] i915_pm_-18748   1.... 31645689158us : 0xffffffffa047efeb: engine_mask=ff
[38218.662728] kworker/-19601   1.... 32641702411us : i915_gem_sanitize: 
[38218.669257] kworker/-19601   1.... 32641702533us : intel_engines_sanitize: 
[38218.676205] kworker/-19601   1.... 32641702535us : intel_gpu_reset: engine_mask=ff
[38218.683764] kworker/-19601   1.... 32641703919us : i915_gem_resume: 
[38218.690120] kworker/-19616   1.... 32648456199us : intel_engines_sanitize: 
[38218.697070] kworker/-19616   1.... 32648456562us : intel_gpu_reset: engine_mask=ff
[38218.704630]  rtcwake-19715   1.... 32656291853us : i915_gem_suspend: 
[38218.711065] kworker/-19587   1.... 32657423968us : i915_gem_sanitize: 
[38218.717591] kworker/-19587   1.... 32657424001us : intel_engines_sanitize: 
[38218.724541] kworker/-19587   1.... 32657424004us : intel_gpu_reset: engine_mask=ff
[38218.732098] kworker/-19587   1.... 32657426000us : i915_gem_resume:
Comment 30 Lakshmi 2019-04-09 07:32:17 UTC
(In reply to Anshuman Gupta from comment #29)
> (In reply to Jani Saarinen from comment #27)
> > BIOS was updated on ICL-U3. 
> > See: 
> > <6>[    0.000000] DMI: Intel Corporation Ice Lake Client Platform/IceLake U
> > DDR4 SODIMM PD RVP TLC, BIOS ICLSFWR1.R00.3121.A00.1903190527 03/19/2019
> 
> Issue is observed even with BIOS 3121 at BA.

@Anshuman/Jani/Francesco, What are the next steps?
Comment 31 Jani Saarinen 2019-04-09 11:23:58 UTC
Get this merged: https://patchwork.freedesktop.org/series/59170/
Comment 32 Lakshmi 2019-04-10 05:40:12 UTC
(In reply to Jani Saarinen from comment #31)
> Get this merged: https://patchwork.freedesktop.org/series/59170/

Review Comments from Rafael J. Wysocki:
This is fine only as long as it doesn't anywhere close to the mainline.

If ACPI idle crashes on new Intel HW, it needs to be fixed to work with 
it instead of refusing to work on it.
Comment 33 Lakshmi 2019-04-10 20:09:39 UTC
How do we proceed with this bug?
Comment 34 Jani Saarinen 2019-04-11 04:01:42 UTC
I would resolve it now as patch there: https://cgit.freedesktop.org/drm-intel/commit/?h=topic/core-for-CI&id=b573fba52f339dc4fadef7282af4a9413fd6173d
but we cannot close this until real fix in upstream from core team.
Comment 35 Martin Peres 2019-04-11 11:42:39 UTC
*** Bug 108990 has been marked as a duplicate of this bug. ***


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.