[ 214.128456] [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe C FIFO underrun
[ 224.735148] asynchronous wait on fence i915:kms_busy/0:1 timed out
*** Bug 107720 has been marked as a duplicate of this bug. ***
Update: working on it. I have already identified a few problems, this is not something we're going to solve with a single patch. I'll provide more updates once I have real patches.
Submitted https://patchwork.freedesktop.org/series/50579/ but I'm not sure it will solve the problem.
I was looking at the logs for fi-icl-u and it seems that sometimes during boot the interrupts just act like crazy: either you get a ton of interrupts, or the IIR registers are unclearable and contain crazy values, with even reserved bits set. I simply can't reproduce this type of problem you're seeing. Example:
This could be some memory corruption happening, or it could simply be a bad BIOS.
Another thing that was brought to my attention is that we often get the mysterious crazy interrupts right after enabling DMC. Would it be possible to run a few tests with DMC disabled? The CI pages suggest the problem happens only around 11% of the time, so a few rounds of tests would probably be needed :/
Perhaps giving me remote access to the machine would also help us move forward a little faster.
I also noticed that this is being classified as Highest, but I thought that classification was reserved for showstopper. Does 11% failure rate still fall within that severity? (just checking)
This issue is occurring in every round of drm-tip execution.
(In reply to Lakshmi from comment #5)
> This issue is occurring in every round of drm-tip execution.
When when it happens, boot.log always show crazyness during machine initialization:
None of the machines we have here can reproduce this problem. I wonder if it's a hardware/bios issue with the specific ICL machine that's in CI.
The pipe C IIR noise is a bit odd. Bspec it seems to be telling me that these registers live in PG2, whereas the code appears to assume that the register lives in whatever power well the pipe lives in. So that might be a bit wrong in the code (though it should probably still work just fine) or the spec is wrong. Either way since all the power wells up to pg4 should be enabled it shouldn't really matter here either way. I agree with Paulo that DMC might have something to do with this as well.
The WARN_ON(!intel_pstate->base.fb) is also mysterious. We should have either reused the BIOS fb or disabled all the planes. So can't really see how an enabled plane could get that far without a framebuffer.
(In reply to Paulo Zanoni from comment #6)
> (In reply to Lakshmi from comment #5)
> > This issue is occurring in every round of drm-tip execution.
> When when it happens, boot.log always show crazyness during machine
> None of the machines we have here can reproduce this problem. I wonder if
> it's a hardware/bios issue with the specific ICL machine that's in CI.
Paulo, last seen this issue
Looks like this is still happening. Do you think this occurred for some other reason?
After discussion with Paulo and JaniS, it appears this problem is specific to this one ICL board, as it's not reproducing on the other ICL in CI, and we can't reproduce on any of our local hardware. It was agreed to swap a different board in for CI. I'm lowering this to Medium, and if the issue doesn't reproduce with the new CI HW, we should close this.
(In reply to James Ausmus from comment #9)
> After discussion with Paulo and JaniS, it appears this problem is specific
> to this one ICL board, as it's not reproducing on the other ICL in CI, and
> we can't reproduce on any of our local hardware. It was agreed to swap a
> different board in for CI. I'm lowering this to Medium, and if the issue
> doesn't reproduce with the new CI HW, we should close this.
This issue appears in fi-icl-U2 as well.
Rising the priority as it happens with BAT.
Hmm - this looks suspicious. If you take a look at https://intel-gfx-ci.01.org/tree/drm-tip/fi-icl-u2.html you'll see that the only two times igt@gem_ctx_create@basic-files *failed*, are the two times that igt@debugfs_test@read_all_entries and igt@gem_exec_suspend@basic-s3 *didn't* fail with powerwell related errors.
There could be interrelation with the powerwell issues.
This hasn't occurred again in BAT since the two tests with the suspicious result pattern. Moving this back to high
Waiting to mvoe forward with this with the idea that Ville's watermark series at https://patchwork.freedesktop.org/series/51878/ might help out here.
Ville, you have some WM patches in review, should they help here.
On latest runs this now getting worse, also on ICL-u2
Affecting 289 tests on CI.