Bug 107724 - [CI] [BAT] [DRMTIP] igt@* - dmesg-warn / dmesg-fail - *ERROR* CPU pipe [ABC] FIFO underrun
Summary: [CI] [BAT] [DRMTIP] igt@* - dmesg-warn / dmesg-fail - *ERROR* CPU pipe [ABC] ...
Status: ASSIGNED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: Other All
: highest critical
Assignee: Ville Syrjala
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
: 107720 (view as bug list)
Depends on:
Blocks: 105980
  Show dependency treegraph
 
Reported: 2018-08-28 15:23 UTC by Martin Peres
Modified: 2018-12-07 22:28 UTC (History)
4 users (show)

See Also:
i915 platform: ICL
i915 features: display/Other


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Comment 1 James Ausmus 2018-09-24 23:03:21 UTC
*** Bug 107720 has been marked as a duplicate of this bug. ***
Comment 2 Paulo Zanoni 2018-09-28 17:16:22 UTC
Update: working on it. I have already identified a few problems, this is not something we're going to solve with a single patch. I'll provide more updates once I have real patches.
Comment 3 Paulo Zanoni 2018-10-04 23:33:06 UTC
Submitted https://patchwork.freedesktop.org/series/50579/ but I'm not sure it will solve the problem.

I was looking at the logs for fi-icl-u and it seems that sometimes during boot the interrupts just act like crazy: either you get a ton of interrupts, or the IIR registers are unclearable and contain crazy values, with even reserved bits set. I simply can't reproduce this type of problem you're seeing. Example:

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_115/fi-icl-u/boot3.log

This could be some memory corruption happening, or it could simply be a bad BIOS.

Another thing that was brought to my attention is that we often get the mysterious crazy interrupts right after enabling DMC. Would it be possible to run a few tests with DMC disabled? The CI pages suggest the problem happens only around 11% of the time, so a few rounds of tests would probably be needed :/

Perhaps giving me remote access to the machine would also help us move forward a little faster.

Thanks,
Paulo
Comment 4 steven.j.hockemeier 2018-10-09 20:22:52 UTC
I also noticed that this is being classified as Highest, but I thought that classification was reserved for showstopper.  Does 11% failure rate still fall within that severity?  (just checking)
Comment 5 Lakshmi 2018-10-10 06:52:19 UTC
This issue is occurring in every round of drm-tip execution.
Comment 6 Paulo Zanoni 2018-10-12 17:09:45 UTC
(In reply to Lakshmi from comment #5)
> This issue is occurring in every round of drm-tip execution.

When when it happens, boot.log always show crazyness during machine initialization:

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_125/fi-icl-u/boot8.log

None of the machines we have here can reproduce this problem. I wonder if it's a hardware/bios issue with the specific ICL machine that's in CI.
Comment 7 Ville Syrjala 2018-10-12 20:33:30 UTC
The pipe C IIR noise is a bit odd. Bspec it seems to be telling me that these registers live in PG2, whereas the code appears to assume that the register lives in whatever power well the pipe lives in. So that might be a bit wrong in the code (though it should probably still work just fine) or the spec is wrong. Either way since all the power wells up to pg4 should be enabled it shouldn't really matter here either way. I agree with Paulo that DMC might have something to do with this as well.

The WARN_ON(!intel_pstate->base.fb) is also mysterious. We should have either reused the BIOS fb or disabled all the planes. So can't really see how an enabled plane could get that far without a framebuffer.
Comment 8 Lakshmi 2018-10-15 16:30:03 UTC
(In reply to Paulo Zanoni from comment #6)
> (In reply to Lakshmi from comment #5)
> > This issue is occurring in every round of drm-tip execution.
> 
> When when it happens, boot.log always show crazyness during machine
> initialization:
> 
> https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_125/fi-icl-u/boot8.log
> 
> None of the machines we have here can reproduce this problem. I wonder if
> it's a hardware/bios issue with the specific ICL machine that's in CI.

Paulo, last seen this issue 
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_127/fi-icl-u/igt@kms_busy@extended-modeset-hang-oldfb-render-a.html

Looks like this is still happening. Do you think this occurred for some other reason?
Comment 9 James Ausmus 2018-10-18 18:28:00 UTC
After discussion with Paulo and JaniS, it appears this problem is specific to this one ICL board, as it's not reproducing on the other ICL in CI, and we can't reproduce on any of our local hardware. It was agreed to swap a different board in for CI. I'm lowering this to Medium, and if the issue doesn't reproduce with the new CI HW, we should close this.
Comment 10 Lakshmi 2018-10-30 14:47:36 UTC
(In reply to James Ausmus from comment #9)
> After discussion with Paulo and JaniS, it appears this problem is specific
> to this one ICL board, as it's not reproducing on the other ICL in CI, and
> we can't reproduce on any of our local hardware. It was agreed to swap a
> different board in for CI. I'm lowering this to Medium, and if the issue
> doesn't reproduce with the new CI HW, we should close this.

This issue appears in fi-icl-U2 as well.
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4700/fi-icl-u2/igt@gem_ctx_create@basic-files.html
Rising the priority as it happens with BAT.
Comment 11 James Ausmus 2018-10-30 15:00:08 UTC
Hmm - this looks suspicious. If you take a look at https://intel-gfx-ci.01.org/tree/drm-tip/fi-icl-u2.html you'll see that the only two times igt@gem_ctx_create@basic-files *failed*, are the two times that igt@debugfs_test@read_all_entries and igt@gem_exec_suspend@basic-s3 *didn't* fail with powerwell related errors.

There could be interrelation with the powerwell issues.
Comment 13 James Ausmus 2018-11-02 16:01:33 UTC
This hasn't occurred again in BAT since the two tests with the suspicious result pattern. Moving this back to high
Comment 14 James Ausmus 2018-11-14 21:23:26 UTC
Waiting to mvoe forward with this with the idea that Ville's watermark series at https://patchwork.freedesktop.org/series/51878/ might help out here.
Comment 15 Jani Saarinen 2018-11-15 15:46:21 UTC
Ville, you have some WM patches in review, should they help here.
On latest runs this now getting worse, also on ICL-u2
Comment 16 Radosław Szwichtenberg 2018-11-16 13:13:02 UTC
Affecting 289 tests on CI.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.