Bug 107724

Summary: [CI] [BAT] [DRMTIP] igt@* - dmesg-warn / dmesg-fail - *ERROR* CPU pipe [ABC] FIFO underrun
Product: DRI Reporter: Martin Peres <martin.peres>
Component: DRM/IntelAssignee: Ville Syrjala <ville.syrjala>
Status: RESOLVED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: critical    
Priority: highest CC: intel-gfx-bugs, james.ausmus, marta.lofstedt, matthew.d.roper, przanoni, ricardo.o.perez, ville.syrjala
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard: ReadyForDev
i915 platform: ICL i915 features: display/Other
Bug Depends on:    
Bug Blocks: 105980    

Comment 1 James Ausmus 2018-09-24 23:03:21 UTC
*** Bug 107720 has been marked as a duplicate of this bug. ***
Comment 2 Paulo Zanoni 2018-09-28 17:16:22 UTC
Update: working on it. I have already identified a few problems, this is not something we're going to solve with a single patch. I'll provide more updates once I have real patches.
Comment 3 Paulo Zanoni 2018-10-04 23:33:06 UTC
Submitted https://patchwork.freedesktop.org/series/50579/ but I'm not sure it will solve the problem.

I was looking at the logs for fi-icl-u and it seems that sometimes during boot the interrupts just act like crazy: either you get a ton of interrupts, or the IIR registers are unclearable and contain crazy values, with even reserved bits set. I simply can't reproduce this type of problem you're seeing. Example:

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_115/fi-icl-u/boot3.log

This could be some memory corruption happening, or it could simply be a bad BIOS.

Another thing that was brought to my attention is that we often get the mysterious crazy interrupts right after enabling DMC. Would it be possible to run a few tests with DMC disabled? The CI pages suggest the problem happens only around 11% of the time, so a few rounds of tests would probably be needed :/

Perhaps giving me remote access to the machine would also help us move forward a little faster.

Thanks,
Paulo
Comment 4 steven.j.hockemeier 2018-10-09 20:22:52 UTC
I also noticed that this is being classified as Highest, but I thought that classification was reserved for showstopper.  Does 11% failure rate still fall within that severity?  (just checking)
Comment 5 Lakshmi 2018-10-10 06:52:19 UTC
This issue is occurring in every round of drm-tip execution.
Comment 6 Paulo Zanoni 2018-10-12 17:09:45 UTC
(In reply to Lakshmi from comment #5)
> This issue is occurring in every round of drm-tip execution.

When when it happens, boot.log always show crazyness during machine initialization:

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_125/fi-icl-u/boot8.log

None of the machines we have here can reproduce this problem. I wonder if it's a hardware/bios issue with the specific ICL machine that's in CI.
Comment 7 Ville Syrjala 2018-10-12 20:33:30 UTC
The pipe C IIR noise is a bit odd. Bspec it seems to be telling me that these registers live in PG2, whereas the code appears to assume that the register lives in whatever power well the pipe lives in. So that might be a bit wrong in the code (though it should probably still work just fine) or the spec is wrong. Either way since all the power wells up to pg4 should be enabled it shouldn't really matter here either way. I agree with Paulo that DMC might have something to do with this as well.

The WARN_ON(!intel_pstate->base.fb) is also mysterious. We should have either reused the BIOS fb or disabled all the planes. So can't really see how an enabled plane could get that far without a framebuffer.
Comment 8 Lakshmi 2018-10-15 16:30:03 UTC
(In reply to Paulo Zanoni from comment #6)
> (In reply to Lakshmi from comment #5)
> > This issue is occurring in every round of drm-tip execution.
> 
> When when it happens, boot.log always show crazyness during machine
> initialization:
> 
> https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_125/fi-icl-u/boot8.log
> 
> None of the machines we have here can reproduce this problem. I wonder if
> it's a hardware/bios issue with the specific ICL machine that's in CI.

Paulo, last seen this issue 
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_127/fi-icl-u/igt@kms_busy@extended-modeset-hang-oldfb-render-a.html

Looks like this is still happening. Do you think this occurred for some other reason?
Comment 9 James Ausmus 2018-10-18 18:28:00 UTC
After discussion with Paulo and JaniS, it appears this problem is specific to this one ICL board, as it's not reproducing on the other ICL in CI, and we can't reproduce on any of our local hardware. It was agreed to swap a different board in for CI. I'm lowering this to Medium, and if the issue doesn't reproduce with the new CI HW, we should close this.
Comment 10 Lakshmi 2018-10-30 14:47:36 UTC
(In reply to James Ausmus from comment #9)
> After discussion with Paulo and JaniS, it appears this problem is specific
> to this one ICL board, as it's not reproducing on the other ICL in CI, and
> we can't reproduce on any of our local hardware. It was agreed to swap a
> different board in for CI. I'm lowering this to Medium, and if the issue
> doesn't reproduce with the new CI HW, we should close this.

This issue appears in fi-icl-U2 as well.
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4700/fi-icl-u2/igt@gem_ctx_create@basic-files.html
Rising the priority as it happens with BAT.
Comment 11 James Ausmus 2018-10-30 15:00:08 UTC
Hmm - this looks suspicious. If you take a look at https://intel-gfx-ci.01.org/tree/drm-tip/fi-icl-u2.html you'll see that the only two times igt@gem_ctx_create@basic-files *failed*, are the two times that igt@debugfs_test@read_all_entries and igt@gem_exec_suspend@basic-s3 *didn't* fail with powerwell related errors.

There could be interrelation with the powerwell issues.
Comment 13 James Ausmus 2018-11-02 16:01:33 UTC
This hasn't occurred again in BAT since the two tests with the suspicious result pattern. Moving this back to high
Comment 14 James Ausmus 2018-11-14 21:23:26 UTC
Waiting to mvoe forward with this with the idea that Ville's watermark series at https://patchwork.freedesktop.org/series/51878/ might help out here.
Comment 15 Jani Saarinen 2018-11-15 15:46:21 UTC
Ville, you have some WM patches in review, should they help here.
On latest runs this now getting worse, also on ICL-u2
Comment 16 Radosław Szwichtenberg 2018-11-16 13:13:02 UTC
Affecting 289 tests on CI.
Comment 17 Jani Saarinen 2018-12-12 07:43:24 UTC
Hi Matt, I see saw comments on https://bugs.freedesktop.org/show_bug.cgi?id=105458 that 
"I also notice that CI indicates a bunch of pre-existing ICL watermark failures are no longer happening with my series (or the earlier revisions of my series), so it's possible that we've also fixed
https://bugs.freedesktop.org/show_bug.cgi?id=107724 "by accident" with this series."

Fingers crossed!
Comment 18 Jani Saarinen 2018-12-17 19:29:16 UTC
Nope, it did not fix. WIP still
Comment 19 Jani Saarinen 2019-01-18 08:43:30 UTC
Still WIP progress to know root cause.
Comment 20 Stanislav Lisovskiy 2019-01-24 12:41:31 UTC
*** Bug 108336 has been marked as a duplicate of this bug. ***
Comment 21 CI Bug Log 2019-02-12 11:07:43 UTC
A CI Bug Log filter associated to this bug has been updated:

{- ICL: igt@runnerr@aborted - fail - Previous test: kms_cursor_crc (cursor-64x64-suspend) -}
{+ ICL: igt@runnerr@aborted - fail - Previous test: kms_cursor_crc (cursor-64x64-suspend)|kms_chamelium \(hdmi-cmp-nv12\) +}

New failures caught by the filter:

* https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_217/fi-icl-u2/igt@runner@aborted.html
Comment 22 Petri Latvala 2019-02-25 10:03:39 UTC
*** Bug 105915 has been marked as a duplicate of this bug. ***
Comment 23 Martin Peres 2019-04-02 11:31:05 UTC
Ville has been working on it, and he has patches addressing the issue: https://patchwork.freedesktop.org/series/58299/

They are not landed because IGT does not cope well with the additional restrictions on plane size. The IGT test fixes are coming along, and we can expect to close this issue in the coming week or so.

The customer impact of this issue is very high as this could lead to transient black screens / flickering, due to exceeding the memory bandwidth available to the display controller.

To fix these issues for good, a test is being developed by Stan to stress-test watermark selection, which will hopefully allow us to iron out the final kinks and make FIFO underruns a thing of the past.
Comment 24 Martin Peres 2019-04-04 12:09:31 UTC
*** Bug 107720 has been marked as a duplicate of this bug. ***
Comment 25 Martin Peres 2019-04-08 11:34:13 UTC
*** Bug 105681 has been marked as a duplicate of this bug. ***
Comment 26 CI Bug Log 2019-04-09 06:50:55 UTC
A CI Bug Log filter associated to this bug has been updated:

{- ICL: igt@runnerr@aborted - fail - Previous test: kms_cursor_crc (cursor-64x64-suspend)|kms_chamelium \(hdmi-cmp-nv12\) -}
{+ ICL: igt@runnerr@aborted - fail - Previous test: kms_cursor_crc (cursor-64x64-suspend)|kms_chamelium (hdmi-cmp-nv12)|kms_plane_lowres (pipe-b-tiling-none) +}

New failures caught by the filter:

* https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_249/fi-icl-u2/igt@runner@aborted.html
Comment 27 CI Bug Log 2019-04-09 06:51:06 UTC
A CI Bug Log filter associated to this bug has been updated:

{- ICL: igt@runnerr@aborted - fail - Previous test: kms_cursor_crc (cursor-64x64-suspend)|kms_chamelium (hdmi-cmp-nv12)|kms_plane_lowres (pipe-b-tiling-none) -}
{+ ICL: igt@runnerr@aborted - fail - Previous test: kms_cursor_crc (cursor-64x64-suspend)|kms_chamelium (hdmi-cmp-nv12)|kms_plane_lowres (pipe-b-tiling-none) +}

 No new failures caught with the new filter
Comment 28 Jani Saarinen 2019-04-11 10:48:51 UTC
Fix WIP, Ville any eta here?
Comment 29 Jani Saarinen 2019-04-15 08:04:23 UTC
Still WIP.
Comment 30 Jani Saarinen 2019-04-23 11:05:15 UTC
Still WIP, talking to architects.
Comment 31 Jani Saarinen 2019-04-25 06:51:28 UTC
Ville, we need patches sent for review asap.
Comment 32 CI Bug Log 2019-04-29 11:41:29 UTC
A CI Bug Log filter associated to this bug has been updated:

{- ICL: igt@runnerr@aborted - fail - Previous test: kms_cursor_crc (cursor-64x64-suspend)|kms_chamelium (hdmi-cmp-nv12)|kms_plane_lowres (pipe-b-tiling-none) -}
{+ ICL: igt@runnerr@aborted - fail - Previous test: kms_cursor_crc (cursor-64x64-suspend)|kms_chamelium (hdmi-cmp-nv12)|kms_plane_lowres (pipe-b-tiling-none|x) +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_267/fi-icl-u2/igt@runner@aborted.html
Comment 33 CI Bug Log 2019-04-30 06:58:18 UTC
A CI Bug Log filter associated to this bug has been updated:

{- ICL: igt@runnerr@aborted - fail - Previous test: kms_cursor_crc (cursor-64x64-suspend)|kms_chamelium (hdmi-cmp-nv12)|kms_plane_lowres (pipe-b-tiling-none|x) -}
{+ ICL: igt@runnerr@aborted - fail - Previous test: kms_cursor_crc (cursor-64x64-suspend)|kms_chamelium (hdmi-cmp-nv12)|kms_plane_lowres (pipe-b-tiling-none|x) +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_268/fi-icl-u2/igt@runner@aborted.html
Comment 34 Jani Saarinen 2019-05-02 15:04:07 UTC
Still WIP, Ville, please update when updates.
Comment 35 Jani Saarinen 2019-05-06 05:25:15 UTC
Reference: https://patchwork.freedesktop.org/series/60271/
Comment 36 Jani Saarinen 2019-05-10 13:20:56 UTC
Under review still
Comment 37 CI Bug Log 2019-05-16 08:21:10 UTC
A CI Bug Log filter associated to this bug has been updated:

{- ICL: igt@runnerr@aborted - fail - Previous test: kms_cursor_crc (cursor-64x64-suspend)|kms_chamelium (hdmi-cmp-nv12)|kms_plane_lowres (pipe-b-tiling-none|x) -}
{+ ICL: igt@runnerr@aborted - fail - Previous test: kms_cursor_crc (cursor-64x64-suspend)|kms_chamelium (hdmi-cmp-nv12)|kms_plane_lowres (pipe-b-tiling-none|x) +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_286/fi-icl-dsi/igt@runner@aborted.html
Comment 38 CI Bug Log 2019-05-16 08:33:36 UTC
A CI Bug Log filter associated to this bug has been updated:

{- ICL: igt@runnerr@aborted - fail - Previous test: kms_cursor_crc (cursor-64x64-suspend)|kms_chamelium (hdmi-cmp-nv12)|kms_plane_lowres (pipe-b-tiling-none|x) -}
{+ ICL: igt@runnerr@aborted - fail - Previous test: kms_cursor_crc (cursor-64x64-suspend)|kms_chamelium (hdmi-cmp-nv12)|kms_plane_lowres (pipe-b-tiling-none|x) +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_286/fi-icl-dsi/igt@runner@aborted.html
Comment 39 CI Bug Log 2019-05-16 08:34:29 UTC
The CI Bug Log issue associated to this bug has been updated.

### Removed filters

* ICL: igt@runnerr@aborted - fail - Previous test: kms_cursor_crc (cursor-64x64-suspend)|kms_chamelium (hdmi-cmp-nv12)|kms_plane_lowres (pipe-b-tiling-none|x) (added on a minute ago)

### New filters associated

* ICL: igt@runnerr@aborted - fail - Previous test: kms_cursor_crc (cursor-64x64-suspend)|kms_chamelium (hdmi-cmp-nv12)|kms_plane_lowres (pipe-b-tiling-none|x)
  (No new failures associated)
Comment 40 CI Bug Log 2019-05-16 08:34:40 UTC
The CI Bug Log issue associated to this bug has been updated.

### Removed filters

* ICL: igt@runnerr@aborted - fail - Previous test: kms_cursor_crc (cursor-64x64-suspend)|kms_chamelium (hdmi-cmp-nv12)|kms_plane_lowres (pipe-b-tiling-none|x) (added on 13 minutes ago)

### New filters associated

* ICL: igt@runnerr@aborted - fail - Previous test: kms_cursor_crc (cursor-64x64-suspend)|kms_chamelium (hdmi-cmp-nv12)|kms_plane_lowres (pipe-b-tiling-none|x)
  (No new failures associated)
Comment 41 Jani Saarinen 2019-06-05 11:46:06 UTC
Changes from Ville were merged in CI_DRM_6150. 
c457d9cf256e drm/i915: Make sure we have enough memory bandwidth on ICL
d284d5145eb8 drm/i915: Make sandybridge_pcode_read() deal with the second data register

There might be some underruns still but this bug was due to major issues seen and that now should be fixed. If new issues lets make new bug rather than re-opening this

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.