Bug 108714

Summary: [CI][BAT] igt@gem_* - incomplete - *ERROR* render/media: timed out waiting for forcewake ack request.
Product: DRI Reporter: Martin Peres <martin.peres>
Component: DRM/IntelAssignee: Chris Wilson <chris>
Status: RESOLVED WORKSFORME QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: high CC: 0.freedesktop, intel-gfx-bugs, jarred.kenny
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard: ReadyForDev
i915 platform: BSW/CHT i915 features: GEM/Other
Attachments:
Description Flags
kernel log
none
dmesg-drmtip
none
acpidump
none
attachment-4149-0.html none

Description Martin Peres 2018-11-12 10:56:16 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5114/fi-bsw-n3050/igt@gem_ctx_create@basic-files.html

<6> [57.233981] [IGT] gem_ctx_create: executing
<5> [57.249621] Setting dangerous option reset - tainting kernel
<6> [57.262685] [IGT] gem_ctx_create: starting subtest basic-files
<3> [58.102484] [drm:fw_domains_get [i915]] *ERROR* render: timed out waiting for forcewake ack request.
<3> [58.166401] [drm:fw_domains_get [i915]] *ERROR* render: timed out waiting for forcewake ack request.
<3> [58.230473] [drm:fw_domains_get [i915]] *ERROR* render: timed out waiting for forcewake ack request.
<3> [58.281637] [drm:fw_domains_get [i915]] *ERROR* render: timed out waiting for forcewake ack request.
<3> [58.332217] [drm:fw_domains_get [i915]] *ERROR* render: timed out waiting for forcewake ack request.
<3> [58.382979] [drm:fw_domains_get [i915]] *ERROR* render: timed out waiting for forcewake ack request.
<3> [58.433659] [drm:fw_domains_get [i915]] *ERROR* render: timed out waiting for forcewake ack request.
<3> [58.484298] [drm:fw_domains_get [i915]] *ERROR* render: timed out waiting for forcewake ack request.
<3> [58.550682] [drm:fw_domains_get [i915]] *ERROR* render: timed out waiting for forcewake ack request.
<3> [58.614753] [drm:fw_domains_get [i915]] *ERROR* render: timed out waiting for forcewake ack request.
<3> [58.665995] [drm:fw_domains_get [i915]] *ERROR* render: timed out waiting for forcewake ack request.
<3> [58.717261] [drm:fw_domains_get [i915]] *ERROR* render: timed out waiting for forcewake ack request.
<3> [58.768567] [drm:fw_domains_get [i915]] *ERROR* render: timed out waiting for forcewake ack request.
<3> [58.819376] [drm:fw_domains_get [i915]] *ERROR* render: timed out waiting for forcewake ack request.
<3> [58.871196] [drm:fw_domains_get [i915]] *ERROR* render: timed out waiting for forcewake ack request.
<3> [58.934863] [drm:fw_domains_get [i915]] *ERROR* render: timed out waiting for forcewake ack request.
Comment 1 Chris Wilson 2018-11-12 11:08:30 UTC
That's a first for bsw. I wouldn't be surprised if it coincided with a GPU hang. We will never know!
Comment 2 Francesco Balestrieri 2018-11-20 08:49:19 UTC
Last happened 12 hours ago. What's needed to investigate further?
Comment 3 Francesco Balestrieri 2018-11-23 10:51:55 UTC
Trying to reproduce locally.
Comment 4 Jarred Kenny 2018-11-27 19:48:30 UTC
Created attachment 142633 [details]
kernel log
Comment 5 Jarred Kenny 2018-11-27 19:48:42 UTC
I have been attempting to track down this issue for some time now. We run into this issue on our devices every ~100 of so boots. 

When this occurs, the system is locked up entirely and does not respond to any keyboard input. The logs are also missing from the system's journal on the following boot. 

I was able to capture the attached log over a serial console, with drm.debug=0xfe in my kernel line. 

We also have hundreds of these devices and I've been reproducing this issue by rebooting 12 of them every minute via cron. With 12 of them rebooting every minute, we generally see 4-5 failures a day.

I would be happy to apply and test kernel patches on our test bed of devices if that is useful.
Comment 6 Chris Wilson 2018-11-27 19:54:52 UTC
(In reply to Jarred Kenny from comment #4)
> Created attachment 142633 [details]
> kernel log

No GPU hang. Earliest errors are

[    8.446640] [drm:vlv_set_power_well] *ERROR* timeout setting power well state 00000000 (fcfff3ff)
[    8.460049] [drm:chv_dpio_cmn_power_well_enable] *ERROR* Display PHY 0 is not power up
[    8.472407] ------------[ cut here ]------------
[    8.484644] DPIO read pipe A reg 0x8170 == 0xffffffff
[    8.496940] WARNING: CPU: 2 PID: 69 at /home/tj/build/linux-4.18.6/drivers/gpu/drm/i915/intel_sideband.c:193 vlv_dpio_read+0x75/0x90
[    8.509414] Modules linked in:
[    8.521760] CPU: 2 PID: 69 Comm: kworker/2:1 Tainted: G        W         4.18.0-0.bpo.1-amd64 #1 Debian 4.18.6-1~bpo9+1
[    8.534306] Hardware name: Intel Corporation STK1A32SC/STK1A32SC, BIOS SCCHTAX5.86A.0041.2018.0620.1803 06/20/2018
[    8.546914] Workqueue: events output_poll_execute
[    8.559417] RIP: 0010:vlv_dpio_read+0x75/0x90
[    8.571805] Code: 48 8b 4c 24 08 65 48 33 0c 25 28 00 00 00 75 22 48 83 c4 10 5b 5d c3 8d 73 41 89 c1 89 ea 48 c7 c7 a0 d7 9d ac e8 6b 8d a0 ff <0f> 0b 8b 44 24 04 eb ce e8 fe 8f a0 ff 0f 1f 40 00 66 2e 0f 1f 84
[    8.585002] RSP: 0000:ffffa8228063bb88 EFLAGS: 00010286
[    8.597944] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffffacc59d08
[    8.610873] RDX: 0000000000000001 RSI: 0000000000000082 RDI: 0000000000000246
[    8.623696] RBP: 0000000000008170 R08: 0000000000000000 R09: 0000000000000639
[    8.636388] R10: 0000000000001af0 R11: 0000000000000000 R12: 0000000000000000
[    8.648938] R13: ffffffffacce3940 R14: ffff9cc137454918 R15: 0000000000000000
[    8.661401] FS:  0000000000000000(0000) GS:ffff9cc17ad00000(0000) knlGS:0000000000000000
[    8.673860] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    8.686224] CR2: ffffa8228045c000 CR3: 000000007620a000 CR4: 00000000001006e0
[    8.698578] Call Trace:
[    8.710792]  ? _cond_resched+0x16/0x40
[    8.722928]  chv_dpio_cmn_power_well_enable+0xa4/0x1f0
[    8.734994]  intel_power_well_enable+0x34/0x40
[    8.746964]  __intel_display_power_get_domain+0x7c/0x90
[    8.758881]  intel_display_power_get+0x2e/0x40
[    8.770701]  intel_dp_detect+0x11c/0x5d0
[    8.782449]  drm_helper_probe_single_connector_modes+0xd9/0x6f0
[    8.794149]  ? drm_setup_crtcs+0x3d/0xa80
[    8.805708]  ? drm_dbg+0x84/0x90
[    8.817121]  drm_setup_crtcs+0x13c/0xa80
[    8.828489]  ? drm_dbg+0x84/0x90
[    8.839698]  ? drm_fb_helper_is_bound.isra.16+0x73/0xa0
[    8.850851]  drm_fb_helper_hotplug_event.part.30+0x74/0xb0
[    8.861950]  output_poll_execute+0xa9/0x1a0
[    8.872910]  process_one_work+0x191/0x370
[    8.883731]  worker_thread+0x204/0x3b0
[    8.894428]  kthread+0xf8/0x130
[    8.905118]  ? rescuer_thread+0x340/0x340
[    8.915716]  ? kthread_create_worker_on_cpu+0x70/0x70
[    8.926283]  ret_from_fork+0x35/0x40
[    8.936749] ---[ end trace 7f736234f2558a0d ]---
[    8.947140] ------------[ cut here ]------------
[    8.948334] systemd[1]: Listening on udev Kernel Socket.
[    8.957469] DPIO read pipe A reg 0x8098 == 0xffffffff
[    8.974893] WARNING: CPU: 2 PID: 69 at /home/tj/build/linux-4.18.6/drivers/gpu/drm/i915/intel_sideband.c:193 vlv_dpio_read+0x75/0x90
[[0;32m  OK  [0m] Listening on udev Kernel Socket.
[    8.985356] Modules linked in:
[    8.995827] CPU: 2 PID: 69 Comm: kworker/2:1 Tainted: G        W         4.18.0-0.bpo.1-amd64 #1 Debian 4.18.6-1~bpo9+1
[    9.006485] Hardware name: Intel Corporation STK1A32SC/STK1A32SC, BIOS SCCHTAX5.86A.0041.2018.0620.1803 06/20/2018
[    9.017166] Workqueue: events output_poll_execute
[    9.027724] RIP: 0010:vlv_dpio_read+0x75/0x90
[    9.038146] Code: 48 8b 4c 24 08 65 48 33 0c 25 28 00 00 00 75 22 48 83 c4 10 5b 5d c3 8d 73 41 89 c1 89 ea 48 c7 c7 a0 d7 9d ac e8 6b 8d a0 ff <0f> 0b 8b 44 24 04 eb ce e8 fe 8f a0 ff 0f 1f 40 00 66 2e 0f 1f 84
[    9.049277] RSP: 0000:ffffa8228063bb88 EFLAGS: 00010286
[    9.060121] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[    9.071048] RDX: 0000000000000001 RSI: 0000000000000082 RDI: 0000000000000246
[    9.081807] RBP: 0000000000008098 R08: 0000000000000000 R09: 0000000000000002
[    9.092404] R10: 0000000000001af0 R11: 0000000000000000 R12: 0000000000000000
[    9.102856] R13: ffffffffacce3940 R14: ffff9cc137454918 R15: 0000000000000000
[    9.113185] FS:  0000000000000000(0000) GS:ffff9cc17ad00000(0000) knlGS:0000000000000000
[    9.123494] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    9.133682] CR2: ffffa8228045c000 CR3: 000000007620a000 CR4: 00000000001006e0
[    9.143822] Call Trace:
[    9.153861]  chv_dpio_cmn_power_well_enable+0x1cb/0x1f0
[    9.163993]  intel_power_well_enable+0x34/0x40
[    9.174077]  __intel_display_power_get_domain+0x7c/0x90
[    9.184140]  intel_display_power_get+0x2e/0x40
[    9.194177]  intel_dp_detect+0x11c/0x5d0
[    9.204091]  drm_helper_probe_single_connector_modes+0xd9/0x6f0
[    9.214018]  ? drm_setup_crtcs+0x3d/0xa80
[    9.223821]  ? drm_dbg+0x84/0x90
[    9.233570]  drm_setup_crtcs+0x13c/0xa80
[    9.243134]  ? drm_dbg+0x84/0x90
[    9.252478]  ? drm_fb_helper_is_bound.isra.16+0x73/0xa0
[    9.261757]  drm_fb_helper_hotplug_event.part.30+0x74/0xb0
[    9.270909]  output_poll_execute+0xa9/0x1a0
[    9.279901]  process_one_work+0x191/0x370
[    9.288873]  worker_thread+0x204/0x3b0
[    9.297512]  kthread+0xf8/0x130
[    9.305772]  ? rescuer_thread+0x340/0x340
[    9.313997]  ? kthread_create_worker_on_cpu+0x70/0x70
[    9.322277]  ret_from_fork+0x35/0x40
[    9.330554] ---[ end trace 7f736234f2558a0e ]---

Being powerwell related, very suspicious.

First thing is to repeat with https://cgit.freedesktop.org/drm/drm-tip
Comment 7 Chris Wilson 2018-11-28 09:25:16 UTC
Fwiw, I was able to reliably hit media forcewake errors by forcing the LRI path for context pdp updates (and running gem_concurrent_blit). Could just be a coincidence.
Comment 8 Jarred Kenny 2018-11-28 12:25:52 UTC
Created attachment 142642 [details]
dmesg-drmtip
Comment 9 Jarred Kenny 2018-11-28 12:26:57 UTC
I ran the same 1-minute reboot test on one of my devices running 4.20-rc4/drm-tip overnight and was able to capture the same failure via serial logging.

I've attached the kernel log capturing the failure above.
Comment 10 Ville Syrjala 2018-11-29 14:35:39 UTC
The punit seems to die, which is not conducive to things actually working.

Imre noticed that it's happening suspiciously just after:
[    5.287531] i2c_designware 808622C1:06: timeout in disabling adapter

We need to check whether the i2c bus is supposedly shared with the punit. I don't see the relevant info msg in the dmesg, so first make sure you have CONFIG_I2C_DESIGNWARE_BAYTRAIL=y in your kernel config, and then attach the output of acpidump.
Comment 11 Tj 2018-11-29 15:43:10 UTC
It looks like you've picked up on the same thread I've been following. Here's some comments I left for Jarred earlier (in IRC) which I don't know that he has seen yet.

08:42 <TJ-> I've been working on the dmesg again and I'm looking to a wider cause. Communication with the GPU sub-systems is partially done using IIC (i2c). The root cause seems to start with "[    5.102322] i2c_designware 808622C1:06: controller timed out". 808622C1 is the designware CHERRY_TRAIL and is supposed to be configured by ACPI DSDT values
09:03 <TJ-> Might be getting closer... there was a patch to fix a very similar problem in fba4adbbf670 (which I suspect might help here based on the commit comment) but that got reverted/changed in 06cb616b1bca and this later change makes a global change but is instigated for a particular platform. I wonder if the first patch might be more correct on your devices.
09:24 <TJ-> OK, we need to capture what little debug messages the i2c-designware driver writes via dev_dbg() so add "debug" to the kernel command-line. This command will show you the messages we can expect to see: "git grep -A1 dev_dbg drivers/i2c/busses/i2c-designware-*"
Comment 12 Martin Peres 2018-12-03 16:35:49 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5186_22/fi-bsw-cyan/igt@gem_close_race@basic-threads.html

<3> [52.827781] [drm:fw_domains_get [i915]] *ERROR* render: timed out waiting for forcewake ack request.
Comment 13 Jarred Kenny 2018-12-03 18:32:05 UTC
Created attachment 142704 [details]
acpidump

Here is the output of acpidump CONFIG_I2C_DESIGNWARE_BAYTRAIL=y built from drm-tip
Comment 14 Francesco Balestrieri 2018-12-11 09:20:06 UTC
Last occurrence in CI five days ago:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5268/fi-bsw-cyan/igt@gem_ctx_create@basic-files.html
Comment 15 Martin Peres 2018-12-12 16:51:21 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_152/fi-bsw-kefka/igt@gem_exec_parallel@render-contexts.html

<3> [185.530019] [drm:fw_domains_get [i915]] *ERROR* media: timed out waiting for forcewake ack request.
<3> [185.593865] [drm:fw_domains_get [i915]] *ERROR* media: timed out waiting for forcewake ack request.
<3> [185.657705] [drm:fw_domains_get [i915]] *ERROR* media: timed out waiting for forcewake ack request.
<3> [185.717010] [drm:fw_domains_get [i915]] *ERROR* media: timed out waiting for forcewake ack request.
<3> [185.771112] [drm:fw_domains_get [i915]] *ERROR* media: timed out waiting for forcewake ack request.
<3> [185.827251] [drm:fw_domains_get [i915]] *ERROR* media: timed out waiting for forcewake ack request.
<3> [185.884902] [drm:fw_domains_get [i915]] *ERROR* media: timed out waiting for forcewake ack request.
<3> [185.935587] [drm:fw_domains_get [i915]] *ERROR* media: timed out waiting for forcewake ack request.
<3> [185.991215] [drm:fw_domains_get [i915]] *ERROR* media: timed out waiting for forcewake ack request.
<3> [186.054227] [drm:fw_domains_get [i915]] *ERROR* media: timed out waiting for forcewake ack request.
<3> [186.105089] [drm:fw_domains_get [i915]] *ERROR* media: timed out waiting for forcewake ack request.
<3> [186.156380] [drm:fw_domains_get [i915]] *ERROR* media: timed out waiting for forcewake ack request.
<3> [186.206660] [drm:fw_domains_get [i915]] *ERROR* media: timed out waiting for forcewake ack request.
<3> [186.257584] [drm:fw_domains_get [i915]] *ERROR* media: timed out waiting for forcewake ack request.
<3> [186.308254] [drm:fw_domains_get [i915]] *ERROR* media: timed out waiting for forcewake ack request
Comment 16 Francesco Balestrieri 2019-01-08 09:22:25 UTC
> We need to check whether the i2c bus is supposedly shared with the punit. 
> I don't see the relevant info msg in the dmesg, so first make sure you have
> CONFIG_I2C_DESIGNWARE_BAYTRAIL=y in your kernel config, and then attach 
> the output of acpidump.

Ville, this has been provided, did it provide any new info?
Comment 17 Francesco Balestrieri 2019-02-06 07:15:44 UTC
Last seen in CI a month and a week ago. Jarred Kenny, would you mind giving another try with the latest drm-tip?

Meanwhile I'm setting the priority to high given the reproducibility rate.
Comment 18 Shuang He 2019-02-06 07:16:00 UTC
Created attachment 143310 [details]
attachment-4149-0.html

sorry,  i will replay your email until 11/2/2019, please call me cell phone if urgency. Sorry for the inconvenience it might bring to you.
Comment 19 Jarred Kenny 2019-02-22 20:46:07 UTC
(In reply to Francesco Balestrieri from comment #17)
> Last seen in CI a month and a week ago. Jarred Kenny, would you mind giving
> another try with the latest drm-tip?
> 
> Meanwhile I'm setting the priority to high given the reproducibility rate.

I've had 4 devices running 5.0.0-rc7/drm-tip today rebooting every minute. So far we've yet to see a single failure after ~400 boots. I will let this test run over the weekend, but things are looking better.
Comment 20 Francesco Balestrieri 2019-03-01 11:59:46 UTC
Jarred Kenny, were you able to reproduce it again?
Comment 21 Martin Peres 2019-03-08 17:03:35 UTC
(In reply to Francesco Balestrieri from comment #20)
> Jarred Kenny, were you able to reproduce it again?

This looks fixed now. Used to happen at least once every 10 runs, and now not seen for over 400 runs. Closing!
Comment 22 CI Bug Log 2019-03-08 17:09:08 UTC
The CI Bug Log issue associated to this bug has been archived.

New failures matching the above filters will not be associated to this bug anymore.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.