Summary: | [BAT][BYT] kms_pipe_crc_basic@suspend-read-crc-pipe-a *ERROR* dp_aux_ch not done status 0x00050000 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Tvrtko Ursulin <tvrtko.ursulin> | ||||||||
Component: | DRM/Intel | Assignee: | Lyude Paul <lyude> | ||||||||
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||||
Severity: | blocker | ||||||||||
Priority: | highest | CC: | intel-gfx-bugs, joonas.lahtinen, petri.latvala | ||||||||
Version: | DRI git | ||||||||||
Hardware: | Other | ||||||||||
OS: | All | ||||||||||
Whiteboard: | |||||||||||
i915 platform: | BYT | i915 features: | display/DP | ||||||||
Attachments: |
|
Description
Tvrtko Ursulin
2016-10-04 08:36:45 UTC
Same error in kms_pipe_crc_basic@suspend-read-crc-pipe-b *** Bug 98095 has been marked as a duplicate of this bug. *** Ville, you seem to be working on this. Assigned to you. Created attachment 127097 [details]
dmesg wth backtrace
So the problem is that the output polling keeps on running while we've suspended the device. Broken by Lyude's poll stuff I believe.
One hole is that we disable drm_kms_helper before we turn off the hpd init_work - that is the init_work can fire and re-enable the drm_kms_helper during suspend. diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c index 89d322215c84..7bb08c503f75 100644 --- a/drivers/gpu/drm/i915/i915_drv.c +++ b/drivers/gpu/drm/i915/i915_drv.c @@ -1416,8 +1416,6 @@ static int i915_drm_suspend(struct drm_device *dev) * properly. */ intel_display_set_init_power(dev_priv, true); - drm_kms_helper_poll_disable(dev); - pci_save_state(pdev); error = i915_gem_suspend(dev); @@ -1430,11 +1428,12 @@ static int i915_drm_suspend(struct drm_device *dev) intel_guc_suspend(dev); intel_display_suspend(dev); - intel_dp_mst_suspend(dev); intel_runtime_pm_disable_interrupts(dev_priv); + intel_hpd_cancel_work(dev_priv); + drm_kms_helper_poll_disable(dev); intel_suspend_encoders(dev_priv); Reference to patch submitted by Chris: https://patchwork.freedesktop.org/series/13447/ Wuth that set still see: Command /opt/igt/tests/kms_pipe_crc_basic --run-subtest suspend-read-crc-pipe-B dmesg [ 642.820104] Suspending console(s) (use no_console_suspend to debug) [ 642.824586] sd 0:0:0:0: [sda] Synchronizing SCSI cache [ 642.825255] sd 0:0:0:0: [sda] Stopping disk [ 642.930597] Broke affinity for irq 87 [ 642.930604] Broke affinity for irq 88 [ 642.950043] Broke affinity for irq 87 [ 642.950050] Broke affinity for irq 88 [ 642.950056] Broke affinity for irq 90 [ 642.964542] Broke affinity for irq 9 [ 642.964562] Broke affinity for irq 87 [ 642.964567] Broke affinity for irq 88 [ 642.964572] Broke affinity for irq 89 [ 642.964577] Broke affinity for irq 90 [ 642.977378] cache: parent cpu1 should not be sleeping [ 642.985264] [drm:intel_dp_aux_ch [i915]] *ERROR* dp_aux_ch not done status 0x00050000 [ 642.988755] cache: parent cpu2 should not be sleeping [ 642.998519] cache: parent cpu3 should not be sleeping [ 643.070096] sd 0:0:0:0: [sda] Starting disk Please try with diff --git a/drivers/gpu/drm/drm_probe_helper.c b/drivers/gpu/drm/drm_probe_helper.c index f6b64d7d3528..8356692a0e97 100644 --- a/drivers/gpu/drm/drm_probe_helper.c +++ b/drivers/gpu/drm/drm_probe_helper.c @@ -131,6 +131,8 @@ void drm_kms_helper_poll_enable_locked(struct drm_device *dev) struct drm_connector *connector; unsigned long delay = DRM_OUTPUT_POLL_PERIOD; + WARN(1, "Hello, who woke me?"); + WARN_ON(!mutex_is_locked(&dev->mode_config.mutex)); if (!dev->mode_config.poll_enabled || !drm_kms_helper_poll) Created attachment 127209 [details]
dmesg
Both patches applied.
Don't know which stack trace is relevant. Here's before the dp_aux_ch error:
[ 2206.737814] Hello, who woke me?
[ 2206.737841] Modules linked in: snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core intel_power\
clamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_pcm lpc_ich i2c_designware_platform i2c_designware_core i915 r8169 mii i2c_hid sdhci_acpi \
sdhci mmc_core
[ 2206.737847] CPU: 0 PID: 233 Comm: kworker/0:2 Tainted: G W 4.8.0-pl-manual+ #2
[ 2206.737849] Hardware name: GIGABYTE GB-BXBT-1900/MZBAYAB-00, BIOS F6 02/17/2015
[ 2206.737903] Workqueue: events i915_hpd_poll_init_work [i915]
[ 2206.737911] 0000000000000000 ffff88013ababd40 ffffffff813ca25d ffff88013ababd90
[ 2206.737916] 0000000000000000 ffff88013ababd80 ffffffff8107c2b6 0000008600000000
[ 2206.737922] ffff8801324e0000 ffff8801324e0230 ffff8801324e8d90 0000000000000000
[ 2206.737923] Call Trace:
[ 2206.737931] [<ffffffff813ca25d>] dump_stack+0x4f/0x72
[ 2206.737936] [<ffffffff8107c2b6>] __warn+0xc6/0xe0
[ 2206.737941] [<ffffffff8107c31a>] warn_slowpath_fmt+0x4a/0x50
[ 2206.737946] [<ffffffff8109faad>] ? finish_task_switch+0x7d/0x1d0
[ 2206.737951] [<ffffffff814ce5f0>] drm_kms_helper_poll_enable_locked+0x20/0x100
[ 2206.738004] [<ffffffffa014eb38>] i915_hpd_poll_init_work+0xb8/0xd0 [i915]
[ 2206.738010] [<ffffffff81093b6c>] process_one_work+0x14c/0x480
[ 2206.738014] [<ffffffff81093ee3>] worker_thread+0x43/0x4e0
[ 2206.738019] [<ffffffff81093ea0>] ? process_one_work+0x480/0x480
[ 2206.738023] [<ffffffff81093ea0>] ? process_one_work+0x480/0x480
[ 2206.738027] [<ffffffff810994e4>] kthread+0xc4/0xe0
[ 2206.738033] [<ffffffff8176bdb9>] ? _raw_spin_unlock_irq+0x9/0x10
[ 2206.738038] [<ffffffff8176c3bf>] ret_from_fork+0x1f/0x40
[ 2206.738042] [<ffffffff81099420>] ? kthread_worker_fn+0x180/0x180
[ 2206.738045] ---[ end trace ec3ff52ee7faea67 ]---
And after:
[ 2209.156626] Hello, who woke me?
[ 2209.156655] Modules linked in: snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core intel_power\
clamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_pcm lpc_ich i2c_designware_platform i2c_designware_core i915 r8169 mii i2c_hid sdhci_acpi \
sdhci mmc_core
[ 2209.156661] CPU: 3 PID: 10932 Comm: kworker/u8:31 Tainted: G W 4.8.0-pl-manual+ #2
[ 2209.156663] Hardware name: GIGABYTE GB-BXBT-1900/MZBAYAB-00, BIOS F6 02/17/2015
[ 2209.156671] Workqueue: events_unbound async_run_entry_fn
[ 2209.156678] 0000000000000000 ffff88013499bc38 ffffffff813ca25d ffff88013499bc88
[ 2209.156684] 0000000000000000 ffff88013499bc78 ffffffff8107c2b6 00000086810d8d21
[ 2209.156689] ffff8801324e0000 ffff8801324e0000 ffffffffa017e3c0 ffffffff81412490
[ 2209.156691] Call Trace:
[ 2209.156698] [<ffffffff813ca25d>] dump_stack+0x4f/0x72
[ 2209.156703] [<ffffffff8107c2b6>] __warn+0xc6/0xe0
[ 2209.156710] [<ffffffff81412490>] ? pci_pm_resume_noirq+0xa0/0xa0
[ 2209.156714] [<ffffffff8107c31a>] warn_slowpath_fmt+0x4a/0x50
[ 2209.156720] [<ffffffff8109328e>] ? __queue_delayed_work+0x7e/0x180
[ 2209.156725] [<ffffffff814ce5f0>] drm_kms_helper_poll_enable_locked+0x20/0x100
[ 2209.156729] [<ffffffff814cebe1>] drm_kms_helper_poll_enable+0x21/0x30
[ 2209.156776] [<ffffffffa00c3781>] i915_drm_resume+0x131/0x180 [i915]
[ 2209.156821] [<ffffffffa00c37e9>] i915_pm_restore+0x19/0x20 [i915]
[ 2209.156866] [<ffffffffa00c3809>] i915_pm_resume+0x9/0x10 [i915]
[ 2209.156871] [<ffffffff81412501>] pci_pm_resume+0x71/0xe0
[ 2209.156876] [<ffffffff815151fb>] dpm_run_callback+0x3b/0x140
[ 2209.156881] [<ffffffff815157f0>] device_resume+0x90/0x250
[ 2209.156885] [<ffffffff815159c8>] async_resume+0x18/0x40
[ 2209.156890] [<ffffffff8109b872>] async_run_entry_fn+0x32/0x140
[ 2209.156894] [<ffffffff81093b6c>] process_one_work+0x14c/0x480
[ 2209.156899] [<ffffffff81093ee3>] worker_thread+0x43/0x4e0
[ 2209.156903] [<ffffffff81093ea0>] ? process_one_work+0x480/0x480
[ 2209.156907] [<ffffffff810994e4>] kthread+0xc4/0xe0
[ 2209.156913] [<ffffffff8176bdb9>] ? _raw_spin_unlock_irq+0x9/0x10
[ 2209.156918] [<ffffffff8176c3bf>] ret_from_fork+0x1f/0x40
[ 2209.156922] [<ffffffff81099420>] ? kthread_worker_fn+0x180/0x180
[ 2209.156925] ---[ end trace ec3ff52ee7faea68 ]---
[ 2209.157009] PM: resume of devices complete after 2301.212 msecs
Assigning to Lyude. So for whatever reason I'm having no luck with getting this bug to show itself even once on this machine. I'm probably going to look over the code a few times and see if I can guess what fixes it by sending a couple of patches to the trybot, but it would be a lot easier if I could actually see it here… So, I have one theory as to why I'm not seeing this with my NUC: the machines I've seen logs from on this bug report have multiple cores, while mine only has a single core. I'm thinking it's possible not having multiple cores here means there's almost no chance of the hotplug polling running at the same time as any of the other resume tasks in the kernel. I'm trying to get my hands on another NUC with more then one processor, but for the time being would one of you be able to run this test with "maxcpus=1" added to the kernel cmdline and see if the problem still happens? (In reply to Lyude Paul from comment #11) > So, I have one theory as to why I'm not seeing this with my NUC: the > machines I've seen logs from on this bug report have multiple cores, while > mine only has a single core. I'm thinking it's possible not having multiple > cores here means there's almost no chance of the hotplug polling running at > the same time as any of the other resume tasks in the kernel. I'm trying to > get my hands on another NUC with more then one processor, but for the time > being would one of you be able to run this test with "maxcpus=1" added to > the kernel cmdline and see if the problem still happens? I've been running the test repeatedly for about an hour now with that cmdline and the problem does not reproduce indeed. (In reply to Petri Latvala from comment #12) > I've been running the test repeatedly for about an hour now with that > cmdline and the problem does not reproduce indeed. Spoke too soon. Reproduced eventually. (In reply to Lyude Paul from comment #11) > So for whatever reason I'm having no luck with getting this bug to show > itself even once on this machine. I'm probably going to look over the code a > few times and see if I can guess what fixes it by sending a couple of > patches to the trybot, but it would be a lot easier if I could actually see > it here… > > So, I have one theory as to why I'm not seeing this with my NUC: the > machines I've seen logs from on this bug report have multiple cores, while > mine only has a single core. I'm thinking it's possible not having multiple > cores here means there's almost no chance of the hotplug polling running at > the same time as any of the other resume tasks in the kernel. I'm trying to > get my hands on another NUC with more then one processor, but for the time > being would one of you be able to run this test with "maxcpus=1" added to > the kernel cmdline and see if the problem still happens? You could always try whacking a big msleep() somewhere appropriate. (In reply to Petri Latvala from comment #13) > (In reply to Petri Latvala from comment #12) > > I've been running the test repeatedly for about an hour now with that > > cmdline and the problem does not reproduce indeed. > > Spoke too soon. Reproduced eventually. Well judging from the comment times maxcpus=1 made it take 3 hours to hit the issue again, so that's still pretty useful :). So it looks like I was getting a little confused by the Broken irq affinity errors and may have been seeing this bug all along on this NUC without realizing it. I managed to reproduce the hpd poll initialization worker getting called in the midst of late suspend, which should be happening well after we've disabled polling. The cause seems to be that we now deinitialize power wells in late suspend, which ends up causing us to call intel_hpd_poll_init(). So, just checking if we're in suspend before initializing polling on power well deinit should fix this (hopefully). Created attachment 127435 [details] [review] Don't allow polling in late suspend (In reply to Lyude Paul from comment #17) > Created attachment 127435 [details] [review] [review] > Don't allow polling in late suspend Repeated runs so far (for about an hour) with this patch haven't reproduced the error. I'll leave this running over the weekend. (In reply to Petri Latvala from comment #18) > (In reply to Lyude Paul from comment #17) > > Created attachment 127435 [details] [review] [review] [review] > > Don't allow polling in late suspend > > Repeated runs so far (for about an hour) with this patch haven't reproduced > the error. I'll leave this running over the weekend. The error didn't appear at all over the weekend. (In reply to Petri Latvala from comment #19) > (In reply to Petri Latvala from comment #18) > > (In reply to Lyude Paul from comment #17) > > > Created attachment 127435 [details] [review] [review] [review] [review] > > > Don't allow polling in late suspend > > > > Repeated runs so far (for about an hour) with this patch haven't reproduced > > the error. I'll leave this running over the weekend. > > The error didn't appear at all over the weekend. Thank you very much for testing this. I'll get the patch sent out to the ML list and hopefully push it at the end of this week. Thanks Lyude. commit b64b540931483cca3200d98756bed6ad0e01d75c Author: Lyude <lyude@redhat.com> Date: Wed Oct 26 12:36:09 2016 -0400 drm/i915/vlv: Prevent enabling hpd polling in late suspend Verified. No failure "dp_aux_ch not done status" on fi-byt-j1900 during last +40 CI_DRM runs. Latest dmesg visible on 1725 (2016-10-24). Closing verified+fixed. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.