This bug is split from Bug 107713.
This bug is created to address following test failures (incomplete) on ICL platform
The CI Bug Log issue associated to this bug has been updated.
### New filters associated
* ICL: igt@kms_rotation_crc@* - incomplete (No new failures associated)
This bug depends on 106885 and based on the hit rate this bug is set as highest priority.
I've been trying to reproduce this bug by running same set of tests as what CI does run to hit this but I never see any hang on ICL box. I'll keep on trying still.
I was reading through dmesgs on these failed tests and they don't make me much smarter as dmesg is cut before the moment when reboot happen.
One notable issue I did find though. For example on one of my logs from CI immediately before incomplete test primary-rotation-90 is run primary-rotation-270. This primary-rotation-270 test has taken 3.02 minutes to finish with 2 4k displays connected to ICL in CI. When I run primary-rotation-270 test on ICL box outside CI with two 4k displays it takes 1.53 minutes to finish. Can't say if this has special meaning but it's difference seen just before reboot.
*** Bug 110026 has been marked as a duplicate of this bug. ***
(In reply to Juha-Pekka Heikkilä from comment #4)
> One notable issue I did find though. For example on one of my logs from CI
> immediately before incomplete test primary-rotation-90 is run
> primary-rotation-270. This primary-rotation-270 test has taken 3.02 minutes
> to finish with 2 4k displays connected to ICL in CI. When I run
> primary-rotation-270 test on ICL box outside CI with two 4k displays it
> takes 1.53 minutes to finish. Can't say if this has special meaning but it's
> difference seen just before reboot.
1 minute per screen seems to fit this. Maybe a bit excessive amount of testing, not sure what it's doing for 1 minute ...
you also have pstore logs that give some info on the incompletes:
You can see "pstore2-1554047628_Panic_1" near the top, which gets you to: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5847/shard-iclb3/pstore2-1554047628_Panic_1.log
that was recovered from the pstore after the crash.
It may prove useful, especially since we see similar crashes in other bugs (https://bugs.freedesktop.org/show_bug.cgi?id=108840, https://bugs.freedesktop.org/show_bug.cgi?id=109507)
Comments from Chris:
This is not a Gem issue. These pstores are just the userspace controlled watchdog firing too early.
I'm still running test sets which earlier crashed u2 box. U3 now look to behave in all expected ways. Test execution times remain stable through different runs and I don't see those one second delays in dmesg close to addfb. Maybe next wait to see what does this box say for full run of IGT tests.
I got it reproduced on my ICL as well.
It hanged both times, timing out intel_dp_aux_xfer. The log is full of those
dp aux channel related error messages. Then it just hangs..
I would try updating BIOS, to see if there is any difference.
7746.298340] [drm:intel_dp_aux_xfer] *ERROR* dp_aux_ch not done status 0xad4003ff
[ 7746.308601] ------------[ cut here ]------------
[ 7746.326490] [drm:__drm_atomic_state_free] Freeing atomic state 00000000fd922f6f
[ 7746.326584] [drm:drm_atomic_state_default_clear] Clearing atomic state 000000000d7cc73f
[ 7746.326586] [drm:__drm_atomic_state_free] Freeing atomic state 000000000d7cc73f
[ 7746.350457] dp_aux_ch not started status 0x7d4003ff
[ 7746.350465] WARNING: CPU: 2 PID: 21 at drivers/gpu/drm/i915/intel_dp.c:1246 intel_dp_aux_xfer+0x39f/0x720
[ 7746.368313] [drm:drm_atomic_state_init] Allocated atomic state 000000000d7cc73f
[ 7746.393725] Modules linked in: x86_pkg_temp_thermal efivarfs
[ 7746.393729] CPU: 2 PID: 21 Comm: kworker/2:0 Tainted: G UD 5.1.0-rc1+ #2
[ 7746.840596] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake U DDR4 SODIMM PD RVP TLC, BIOS ICLSFWR1.R00.3087.A00.1902250334 02/25/2019
[ 7746.881485] Workqueue: events i915_hotplug_work_func
.crtc_clock (exp[ 7746.896355] RIP: 0010:intel_dp_aux_xfer+0x39f/0x720
[ 7746.915114] Code: 00 40 0f 85 2c 02 00 00 83 6c 24 28 01 0f 85 d1 fe ff ff 8b 74 24 3c e9 7d fe ff ff 89 c6 48 c7 c7 68 01 69 af e8 a1 e6 ab ff <0f> 0b 89 2d e9 a7 f1 00 e9 6b fd ff ff 4c 89 ff 49 8b f
[ 7746.971370] RSP: 0018:ffffa76601a0fae8 EFLAGS: 00010286
[ 7746.987013] RAX: 0000000000000000 RBX: 00000000fffffff0 RCX: 0000000000000006
[ 7747.008389] RDX: 0000000000000007 RSI: 0000000000000002 RDI: ffff93839ee95410
[ 7747.029758] RBP: 000000007d4003ff R08: 0000000000000000 R09: 00000000000202c0
[ 7747.051130] R10: 00000000ad4003ff R11: 0000000000bc38e0 R12: 0000000000000004
[ 7747.072498] R13: ffff93839cfd51d0 R14: ffff9383998d8000 R15: ffff93839cfd5110
[ 7747.093866] FS: 0000000000000000(0000) GS:ffff93839ee80000(0000) knlGS:0000000000000000
[ 7747.118098] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7747.135302] CR2: 00007f67af0b91c8 CR3: 000000035e60e002 CR4: 0000000000760ee0
[ 7747.156663] PKRU: 55555554
[ 7747.164756] Call Trace:
[ 7747.172074] ? schedule_hrtimeout_range_clock+0xc2/0x190
ected 533250, fo[ 7747.187980] intel_dp_aux_transfer+0xa7/0x200
[ 7747.205172] ? usleep_range+0x57/0x80
[ 7747.216128] drm_dp_dpcd_access+0x74/0x110
[ [ 7747.228385] drm_dp_dpcd_read+0x29/0xc0
[ 7747.244017] intel_dp_read_dpcd+0x38/0x120
[ 7747.256276] intel_dp_get_dpcd+0x1d/0xe0
+0.000003] [drm:[ 7747.268013] intel_dp_detect+0x349/0x5a0
[ 7747.283909] ? _cond_resched+0x10/0x40
pipe_config_err][ 7747.295129] drm_helper_probe_detect_ctx+0x67/0xd0
[ 7747.313629] intel_encoder_hotplug+0x2b/0x90
mismatch in por[ 7747.326412] intel_ddi_hotplug+0x29/0x2a0
[ 7747.342568] ? drm_dbg+0x7f/0x90
t_clock (expecte[ 7747.352228] i915_hotplug_work_func+0x18f/0x1f0
[ 7747.369950] process_one_work+0x1e3/0x3e0
d 540000, found [ 7747.381955] worker_thread+0x28/0x3c0
[ 7747.397075] ? set_worker_desc+0xb0/0xb0
[ 7747.408814] kthread+0x10e/0x130
[ 7747.418471] ? kthread_park+0x80/0x80
[ +0.0[ 7747.429432] ret_from_fork+0x1f/0x40
[ 7747.444289] WARNING: CPU: 2 PID: 21 at drivers/gpu/drm/i915/intel_dp.c:1246 intel_dp_aux_xfer+0x39f/0x720
00003] [drm:pipe[ 7747.472955] ---[ end trace 5fa4f9eb29cf4f45 ]---
_config_err] mismatch in infoframes.enable (expected 0x00000000, found 0x00000039)
[ +0.000003] [drm:pipe_conf[ 7747.517600] [drm:intel_dp_aux_xfer] *ERROR* dp aux hw did not signal timeout!
ig_infoframe_err] mismatch in av[ 7747.546325] [drm:intel_dp_aux_xfer] dp_aux_ch timeout status 0x7d4003ff
[ +0.000003] [drm:pipe_config_infoframe_err] expected:
[ +2.336211] [drm:intel_b[ 7747.593600] [drm:intel_dp_aux_xfer] *ERROR* dp aux hw did not signal timeout!
w_atomic_check] pipe A num active planes 0
[ +[ 7747.624298] [drm:intel_dp_aux_xfer] dp_aux_ch timeout status 0x7d4003ff
0.000003] [drm:intel_bw_atomic_check] pipe B num active planes 0
[ +0.000004] [drm:intel_bw_atomic_check] pipe C num active pl[ 7747.679730] [drm:intel_dp_aux_xfer] dp_aux_ch timeout status 0x7d4003ff
[ +0.033070] [drm:__drm_atomic_state_free] Freeing atomic state 000000005b3dfba2
[ +0.000213] [drm:drm_mode_addfb2] [[ 7747.731644] [drm:intel_dp_aux_xfer] dp_aux_ch timeout status 0x7d4003ff
[ +0.031768] [drm:drm_atomic_state_init] Allocated atomic state 00000000c27efb7d
[ +0.000003] [drm:drm_atomic_get_plane_state] Added[ 7747.790660] [drm:intel_dp_aux_xfer] dp_aux_ch timeout status 0x7d4003ff
[PLANE:84:plane 1B] 0000000082c41df2 state to 0[ 7747.823600] [drm:intel_dp_aux_xfer] *ERROR* dp aux hw did not signal timeout!
[ +0.000003] [drm:drm_atomic_g[ 7747.857599] [drm:intel_dp_aux_xfer] *ERROR* dp aux hw did not signal timeout!
et_crtc_state] Added [CRTC:136:pipe B] 00000000cf007469 state to[ 7747.892084] [drm:intel_dp_aux_xfer] dp_aux_ch timeout status 0x7d4003ff
[ +0.000002] [drm:drm_atomic_set_crtc_for_plane] Link [PLANE:84:plane 1B] st[ 7747.937205] [drm:intel_dp_aux_xfer] dp_aux_ch timeout status 0x7d4003ff
ate 0000000082c41df2 to [CRTC:136:pipe B]
[ +0.000002] [drm:drm_atomic_set_fb_for_plane] Set [FB:239] for [PLANE:84:plane 1B] [ 7747.989938] [drm:intel_dp_aux_xfer] dp_aux_ch timeout status 0x7d4003ff
[ +0.000002] [drm:drm_atomic_check_only] checking 00000000c27efb7d
[ +0.000007] [drm:[ 7748.041314] [drm:intel_dp_aux_xfer] dp_aux_ch timeout status 0x7d4003ff
intel_plane_atomic_calc_changes] [CRTC:136:pipe B] has [PLANE:84:plane 1B] with fb 239
[ +0.000001] [drm:intel_plane_atomic_calc_changes] [PLANE:84:plane 1B] visible 0 -> 1, [ 7748.105599] [drm:intel_dp_aux_xfer] *ERROR* dp aux hw did not signal timeout!
[ 7748.130159] [drm:intel_dp_aux_xfer] *ERROR* dp_aux_ch not done status 0xad4003ff
There is a known regression that came with a BIOS version that leads to multiple IRQ storms from i2c controlers. After updating fi-icl-u3's BIOS on Friday, it went from being the highest-hit machine to not having a single failure in 2 drmtip runs. fi-icl-u2 has been updated and will show complete results in drmtip_252.
As for fi-icl-y, no BIOS update has been delivered to us, but we verified that we indeed receive a significant amount of IRQs per second. We will implement the recommended workaround for the IRQ storm: disable i2c 4 & 5 in the BIOS settings (Cc: Tomi Sarvela).
The incomplete might mean two things:
- the test could not exit fast-enough after exceeding the timeout
- the machine died when executing the test
Based on the evidence we have, it looks like the problem is that the tests do not have time to run nor exit in time because of the IRQ storm.
The user impact is critical (extreme slowness), but the workaround is easy to apply in case updating the BIOS is not possible. Dropping the priority to HIGH while we keep on monitoring the reproduction rate of the issue on all the platforms.
Issue seen again on ICL-u2 / re even with BIOS updated.
(In reply to Jani Saarinen from comment #12)
> Issue seen again on ICL-u2 / re even with BIOS updated.
Yet to see the issue on 3175 BIOS on fi-icl-u2 and re-icl-u. Also, yet to see the failure on icl-y with newer BIOS 3161.
It start to look like this bug is not a bug in kernel nor the test but features in combination with IGT and CI which cause this "bug". On IGT there has been added more supported framebuffer formats which this test will try all of, this with added multiple monitors mostly being 4k will increase the load causing this test to execute quite a while. ICL on my desk with icl-u where I have two 4k displays (edp&hdmi) this test will run almost steady time 516 seconds which increases the likeliness of this test to hit 'test set overall ~960 seconds timeout in CI'
This test should be rewritten for many different reasons, one of them being excessive use of time. Also this test seems to fail 50% time in creating YUV planes but still in the end says 'success' so I think this test doesn't actually test anything.
Moving to CI infra. Lowering priority.