Summary: | [CI][BAT] igt@gem_* - incomplete - *ERROR* render/media: timed out waiting for forcewake ack request. | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Martin Peres <martin.peres> | ||||||||||
Component: | DRM/Intel | Assignee: | Chris Wilson <chris> | ||||||||||
Status: | RESOLVED WORKSFORME | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||||||
Severity: | normal | ||||||||||||
Priority: | high | CC: | 0.freedesktop, intel-gfx-bugs, jarred.kenny | ||||||||||
Version: | XOrg git | ||||||||||||
Hardware: | Other | ||||||||||||
OS: | All | ||||||||||||
Whiteboard: | ReadyForDev | ||||||||||||
i915 platform: | BSW/CHT | i915 features: | GEM/Other | ||||||||||
Attachments: |
|
Description
Martin Peres
2018-11-12 10:56:16 UTC
That's a first for bsw. I wouldn't be surprised if it coincided with a GPU hang. We will never know! Last happened 12 hours ago. What's needed to investigate further? Trying to reproduce locally. Created attachment 142633 [details]
kernel log
I have been attempting to track down this issue for some time now. We run into this issue on our devices every ~100 of so boots. When this occurs, the system is locked up entirely and does not respond to any keyboard input. The logs are also missing from the system's journal on the following boot. I was able to capture the attached log over a serial console, with drm.debug=0xfe in my kernel line. We also have hundreds of these devices and I've been reproducing this issue by rebooting 12 of them every minute via cron. With 12 of them rebooting every minute, we generally see 4-5 failures a day. I would be happy to apply and test kernel patches on our test bed of devices if that is useful. (In reply to Jarred Kenny from comment #4) > Created attachment 142633 [details] > kernel log No GPU hang. Earliest errors are [ 8.446640] [drm:vlv_set_power_well] *ERROR* timeout setting power well state 00000000 (fcfff3ff) [ 8.460049] [drm:chv_dpio_cmn_power_well_enable] *ERROR* Display PHY 0 is not power up [ 8.472407] ------------[ cut here ]------------ [ 8.484644] DPIO read pipe A reg 0x8170 == 0xffffffff [ 8.496940] WARNING: CPU: 2 PID: 69 at /home/tj/build/linux-4.18.6/drivers/gpu/drm/i915/intel_sideband.c:193 vlv_dpio_read+0x75/0x90 [ 8.509414] Modules linked in: [ 8.521760] CPU: 2 PID: 69 Comm: kworker/2:1 Tainted: G W 4.18.0-0.bpo.1-amd64 #1 Debian 4.18.6-1~bpo9+1 [ 8.534306] Hardware name: Intel Corporation STK1A32SC/STK1A32SC, BIOS SCCHTAX5.86A.0041.2018.0620.1803 06/20/2018 [ 8.546914] Workqueue: events output_poll_execute [ 8.559417] RIP: 0010:vlv_dpio_read+0x75/0x90 [ 8.571805] Code: 48 8b 4c 24 08 65 48 33 0c 25 28 00 00 00 75 22 48 83 c4 10 5b 5d c3 8d 73 41 89 c1 89 ea 48 c7 c7 a0 d7 9d ac e8 6b 8d a0 ff <0f> 0b 8b 44 24 04 eb ce e8 fe 8f a0 ff 0f 1f 40 00 66 2e 0f 1f 84 [ 8.585002] RSP: 0000:ffffa8228063bb88 EFLAGS: 00010286 [ 8.597944] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffffacc59d08 [ 8.610873] RDX: 0000000000000001 RSI: 0000000000000082 RDI: 0000000000000246 [ 8.623696] RBP: 0000000000008170 R08: 0000000000000000 R09: 0000000000000639 [ 8.636388] R10: 0000000000001af0 R11: 0000000000000000 R12: 0000000000000000 [ 8.648938] R13: ffffffffacce3940 R14: ffff9cc137454918 R15: 0000000000000000 [ 8.661401] FS: 0000000000000000(0000) GS:ffff9cc17ad00000(0000) knlGS:0000000000000000 [ 8.673860] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8.686224] CR2: ffffa8228045c000 CR3: 000000007620a000 CR4: 00000000001006e0 [ 8.698578] Call Trace: [ 8.710792] ? _cond_resched+0x16/0x40 [ 8.722928] chv_dpio_cmn_power_well_enable+0xa4/0x1f0 [ 8.734994] intel_power_well_enable+0x34/0x40 [ 8.746964] __intel_display_power_get_domain+0x7c/0x90 [ 8.758881] intel_display_power_get+0x2e/0x40 [ 8.770701] intel_dp_detect+0x11c/0x5d0 [ 8.782449] drm_helper_probe_single_connector_modes+0xd9/0x6f0 [ 8.794149] ? drm_setup_crtcs+0x3d/0xa80 [ 8.805708] ? drm_dbg+0x84/0x90 [ 8.817121] drm_setup_crtcs+0x13c/0xa80 [ 8.828489] ? drm_dbg+0x84/0x90 [ 8.839698] ? drm_fb_helper_is_bound.isra.16+0x73/0xa0 [ 8.850851] drm_fb_helper_hotplug_event.part.30+0x74/0xb0 [ 8.861950] output_poll_execute+0xa9/0x1a0 [ 8.872910] process_one_work+0x191/0x370 [ 8.883731] worker_thread+0x204/0x3b0 [ 8.894428] kthread+0xf8/0x130 [ 8.905118] ? rescuer_thread+0x340/0x340 [ 8.915716] ? kthread_create_worker_on_cpu+0x70/0x70 [ 8.926283] ret_from_fork+0x35/0x40 [ 8.936749] ---[ end trace 7f736234f2558a0d ]--- [ 8.947140] ------------[ cut here ]------------ [ 8.948334] systemd[1]: Listening on udev Kernel Socket. [ 8.957469] DPIO read pipe A reg 0x8098 == 0xffffffff [ 8.974893] WARNING: CPU: 2 PID: 69 at /home/tj/build/linux-4.18.6/drivers/gpu/drm/i915/intel_sideband.c:193 vlv_dpio_read+0x75/0x90 [[0;32m OK [0m] Listening on udev Kernel Socket. [ 8.985356] Modules linked in: [ 8.995827] CPU: 2 PID: 69 Comm: kworker/2:1 Tainted: G W 4.18.0-0.bpo.1-amd64 #1 Debian 4.18.6-1~bpo9+1 [ 9.006485] Hardware name: Intel Corporation STK1A32SC/STK1A32SC, BIOS SCCHTAX5.86A.0041.2018.0620.1803 06/20/2018 [ 9.017166] Workqueue: events output_poll_execute [ 9.027724] RIP: 0010:vlv_dpio_read+0x75/0x90 [ 9.038146] Code: 48 8b 4c 24 08 65 48 33 0c 25 28 00 00 00 75 22 48 83 c4 10 5b 5d c3 8d 73 41 89 c1 89 ea 48 c7 c7 a0 d7 9d ac e8 6b 8d a0 ff <0f> 0b 8b 44 24 04 eb ce e8 fe 8f a0 ff 0f 1f 40 00 66 2e 0f 1f 84 [ 9.049277] RSP: 0000:ffffa8228063bb88 EFLAGS: 00010286 [ 9.060121] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 9.071048] RDX: 0000000000000001 RSI: 0000000000000082 RDI: 0000000000000246 [ 9.081807] RBP: 0000000000008098 R08: 0000000000000000 R09: 0000000000000002 [ 9.092404] R10: 0000000000001af0 R11: 0000000000000000 R12: 0000000000000000 [ 9.102856] R13: ffffffffacce3940 R14: ffff9cc137454918 R15: 0000000000000000 [ 9.113185] FS: 0000000000000000(0000) GS:ffff9cc17ad00000(0000) knlGS:0000000000000000 [ 9.123494] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 9.133682] CR2: ffffa8228045c000 CR3: 000000007620a000 CR4: 00000000001006e0 [ 9.143822] Call Trace: [ 9.153861] chv_dpio_cmn_power_well_enable+0x1cb/0x1f0 [ 9.163993] intel_power_well_enable+0x34/0x40 [ 9.174077] __intel_display_power_get_domain+0x7c/0x90 [ 9.184140] intel_display_power_get+0x2e/0x40 [ 9.194177] intel_dp_detect+0x11c/0x5d0 [ 9.204091] drm_helper_probe_single_connector_modes+0xd9/0x6f0 [ 9.214018] ? drm_setup_crtcs+0x3d/0xa80 [ 9.223821] ? drm_dbg+0x84/0x90 [ 9.233570] drm_setup_crtcs+0x13c/0xa80 [ 9.243134] ? drm_dbg+0x84/0x90 [ 9.252478] ? drm_fb_helper_is_bound.isra.16+0x73/0xa0 [ 9.261757] drm_fb_helper_hotplug_event.part.30+0x74/0xb0 [ 9.270909] output_poll_execute+0xa9/0x1a0 [ 9.279901] process_one_work+0x191/0x370 [ 9.288873] worker_thread+0x204/0x3b0 [ 9.297512] kthread+0xf8/0x130 [ 9.305772] ? rescuer_thread+0x340/0x340 [ 9.313997] ? kthread_create_worker_on_cpu+0x70/0x70 [ 9.322277] ret_from_fork+0x35/0x40 [ 9.330554] ---[ end trace 7f736234f2558a0e ]--- Being powerwell related, very suspicious. First thing is to repeat with https://cgit.freedesktop.org/drm/drm-tip Fwiw, I was able to reliably hit media forcewake errors by forcing the LRI path for context pdp updates (and running gem_concurrent_blit). Could just be a coincidence. Created attachment 142642 [details]
dmesg-drmtip
I ran the same 1-minute reboot test on one of my devices running 4.20-rc4/drm-tip overnight and was able to capture the same failure via serial logging. I've attached the kernel log capturing the failure above. The punit seems to die, which is not conducive to things actually working. Imre noticed that it's happening suspiciously just after: [ 5.287531] i2c_designware 808622C1:06: timeout in disabling adapter We need to check whether the i2c bus is supposedly shared with the punit. I don't see the relevant info msg in the dmesg, so first make sure you have CONFIG_I2C_DESIGNWARE_BAYTRAIL=y in your kernel config, and then attach the output of acpidump. It looks like you've picked up on the same thread I've been following. Here's some comments I left for Jarred earlier (in IRC) which I don't know that he has seen yet. 08:42 <TJ-> I've been working on the dmesg again and I'm looking to a wider cause. Communication with the GPU sub-systems is partially done using IIC (i2c). The root cause seems to start with "[ 5.102322] i2c_designware 808622C1:06: controller timed out". 808622C1 is the designware CHERRY_TRAIL and is supposed to be configured by ACPI DSDT values 09:03 <TJ-> Might be getting closer... there was a patch to fix a very similar problem in fba4adbbf670 (which I suspect might help here based on the commit comment) but that got reverted/changed in 06cb616b1bca and this later change makes a global change but is instigated for a particular platform. I wonder if the first patch might be more correct on your devices. 09:24 <TJ-> OK, we need to capture what little debug messages the i2c-designware driver writes via dev_dbg() so add "debug" to the kernel command-line. This command will show you the messages we can expect to see: "git grep -A1 dev_dbg drivers/i2c/busses/i2c-designware-*" https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5186_22/fi-bsw-cyan/igt@gem_close_race@basic-threads.html <3> [52.827781] [drm:fw_domains_get [i915]] *ERROR* render: timed out waiting for forcewake ack request. Created attachment 142704 [details]
acpidump
Here is the output of acpidump CONFIG_I2C_DESIGNWARE_BAYTRAIL=y built from drm-tip
Last occurrence in CI five days ago: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5268/fi-bsw-cyan/igt@gem_ctx_create@basic-files.html https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_152/fi-bsw-kefka/igt@gem_exec_parallel@render-contexts.html <3> [185.530019] [drm:fw_domains_get [i915]] *ERROR* media: timed out waiting for forcewake ack request. <3> [185.593865] [drm:fw_domains_get [i915]] *ERROR* media: timed out waiting for forcewake ack request. <3> [185.657705] [drm:fw_domains_get [i915]] *ERROR* media: timed out waiting for forcewake ack request. <3> [185.717010] [drm:fw_domains_get [i915]] *ERROR* media: timed out waiting for forcewake ack request. <3> [185.771112] [drm:fw_domains_get [i915]] *ERROR* media: timed out waiting for forcewake ack request. <3> [185.827251] [drm:fw_domains_get [i915]] *ERROR* media: timed out waiting for forcewake ack request. <3> [185.884902] [drm:fw_domains_get [i915]] *ERROR* media: timed out waiting for forcewake ack request. <3> [185.935587] [drm:fw_domains_get [i915]] *ERROR* media: timed out waiting for forcewake ack request. <3> [185.991215] [drm:fw_domains_get [i915]] *ERROR* media: timed out waiting for forcewake ack request. <3> [186.054227] [drm:fw_domains_get [i915]] *ERROR* media: timed out waiting for forcewake ack request. <3> [186.105089] [drm:fw_domains_get [i915]] *ERROR* media: timed out waiting for forcewake ack request. <3> [186.156380] [drm:fw_domains_get [i915]] *ERROR* media: timed out waiting for forcewake ack request. <3> [186.206660] [drm:fw_domains_get [i915]] *ERROR* media: timed out waiting for forcewake ack request. <3> [186.257584] [drm:fw_domains_get [i915]] *ERROR* media: timed out waiting for forcewake ack request. <3> [186.308254] [drm:fw_domains_get [i915]] *ERROR* media: timed out waiting for forcewake ack request > We need to check whether the i2c bus is supposedly shared with the punit.
> I don't see the relevant info msg in the dmesg, so first make sure you have
> CONFIG_I2C_DESIGNWARE_BAYTRAIL=y in your kernel config, and then attach
> the output of acpidump.
Ville, this has been provided, did it provide any new info?
Last seen in CI a month and a week ago. Jarred Kenny, would you mind giving another try with the latest drm-tip? Meanwhile I'm setting the priority to high given the reproducibility rate. Created attachment 143310 [details]
attachment-4149-0.html
sorry, i will replay your email until 11/2/2019, please call me cell phone if urgency. Sorry for the inconvenience it might bring to you.
(In reply to Francesco Balestrieri from comment #17) > Last seen in CI a month and a week ago. Jarred Kenny, would you mind giving > another try with the latest drm-tip? > > Meanwhile I'm setting the priority to high given the reproducibility rate. I've had 4 devices running 5.0.0-rc7/drm-tip today rebooting every minute. So far we've yet to see a single failure after ~400 boots. I will let this test run over the weekend, but things are looking better. Jarred Kenny, were you able to reproduce it again? (In reply to Francesco Balestrieri from comment #20) > Jarred Kenny, were you able to reproduce it again? This looks fixed now. Used to happen at least once every 10 runs, and now not seen for over 400 runs. Closing! The CI Bug Log issue associated to this bug has been archived. New failures matching the above filters will not be associated to this bug anymore. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.