Bug 98040

Summary: [BAT][BYT] kms_pipe_crc_basic@suspend-read-crc-pipe-a *ERROR* dp_aux_ch not done status 0x00050000
Product: DRI Reporter: Tvrtko Ursulin <tvrtko.ursulin>
Component: DRM/IntelAssignee: Lyude Paul <lyude>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: blocker    
Priority: highest CC: intel-gfx-bugs, joonas.lahtinen, petri.latvala
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: BYT i915 features: display/DP
Attachments:
Description Flags
dmesg wth backtrace
none
dmesg
none
Don't allow polling in late suspend none

Description Tvrtko Ursulin 2016-10-04 08:36:45 UTC
Happening since CI_DRM_1635 on fi-byt-j1900.

[  599.358492] Suspending console(s) (use no_console_suspend to debug)
[  599.362551] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[  599.362917] sd 0:0:0:0: [sda] Stopping disk
[  599.475507] Broke affinity for irq 89
[  599.492502] Broke affinity for irq 87
[  599.492512] Broke affinity for irq 89
[  599.512013] Broke affinity for irq 9
[  599.512033] Broke affinity for irq 87
[  599.512039] Broke affinity for irq 88
[  599.512043] Broke affinity for irq 89
[  599.512048] Broke affinity for irq 90
[  599.525011]  cache: parent cpu1 should not be sleeping
[  599.532817] [drm:intel_dp_aux_ch [i915]] *ERROR* dp_aux_ch not done status 0x00050000
[  599.536168]  cache: parent cpu2 should not be sleeping
[  599.545918]  cache: parent cpu3 should not be sleeping
[  599.627223] sd 0:0:0:0: [sda] Starting disk
Comment 1 Jani Saarinen 2016-10-04 19:04:20 UTC
Same error in kms_pipe_crc_basic@suspend-read-crc-pipe-b
Comment 2 yann 2016-10-05 12:20:13 UTC
*** Bug 98095 has been marked as a duplicate of this bug. ***
Comment 3 Jani Saarinen 2016-10-06 19:29:15 UTC
Ville, you seem to be working on this. Assigned to you.
Comment 4 Ville Syrjala 2016-10-07 14:06:42 UTC
Created attachment 127097 [details]
dmesg wth backtrace

So the problem is that the output polling keeps on running while we've suspended the device. Broken by Lyude's poll stuff I believe.
Comment 5 Chris Wilson 2016-10-07 14:26:38 UTC
One hole is that we disable drm_kms_helper before we turn off the hpd init_work - that is the init_work can fire and re-enable the drm_kms_helper during suspend.

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 89d322215c84..7bb08c503f75 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1416,8 +1416,6 @@ static int i915_drm_suspend(struct drm_device *dev)
         * properly. */
        intel_display_set_init_power(dev_priv, true);
 
-       drm_kms_helper_poll_disable(dev);
-
        pci_save_state(pdev);
 
        error = i915_gem_suspend(dev);
@@ -1430,11 +1428,12 @@ static int i915_drm_suspend(struct drm_device *dev)
        intel_guc_suspend(dev);
 
        intel_display_suspend(dev);
-
        intel_dp_mst_suspend(dev);
 
        intel_runtime_pm_disable_interrupts(dev_priv);
+
        intel_hpd_cancel_work(dev_priv);
+       drm_kms_helper_poll_disable(dev);
 
        intel_suspend_encoders(dev_priv);
Comment 6 yann 2016-10-07 15:53:57 UTC
Reference to patch submitted by Chris: https://patchwork.freedesktop.org/series/13447/
Comment 7 Jani Saarinen 2016-10-10 11:13:33 UTC
Wuth that set still see:
Command	/opt/igt/tests/kms_pipe_crc_basic --run-subtest suspend-read-crc-pipe-B

dmesg	
[  642.820104] Suspending console(s) (use no_console_suspend to debug)
[  642.824586] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[  642.825255] sd 0:0:0:0: [sda] Stopping disk
[  642.930597] Broke affinity for irq 87
[  642.930604] Broke affinity for irq 88
[  642.950043] Broke affinity for irq 87
[  642.950050] Broke affinity for irq 88
[  642.950056] Broke affinity for irq 90
[  642.964542] Broke affinity for irq 9
[  642.964562] Broke affinity for irq 87
[  642.964567] Broke affinity for irq 88
[  642.964572] Broke affinity for irq 89
[  642.964577] Broke affinity for irq 90
[  642.977378]  cache: parent cpu1 should not be sleeping
[  642.985264] [drm:intel_dp_aux_ch [i915]] *ERROR* dp_aux_ch not done status 0x00050000
[  642.988755]  cache: parent cpu2 should not be sleeping
[  642.998519]  cache: parent cpu3 should not be sleeping
[  643.070096] sd 0:0:0:0: [sda] Starting disk
Comment 8 Chris Wilson 2016-10-10 12:13:08 UTC
Please try with

diff --git a/drivers/gpu/drm/drm_probe_helper.c b/drivers/gpu/drm/drm_probe_helper.c
index f6b64d7d3528..8356692a0e97 100644
--- a/drivers/gpu/drm/drm_probe_helper.c
+++ b/drivers/gpu/drm/drm_probe_helper.c
@@ -131,6 +131,8 @@ void drm_kms_helper_poll_enable_locked(struct drm_device *dev)
        struct drm_connector *connector;
        unsigned long delay = DRM_OUTPUT_POLL_PERIOD;
 
+       WARN(1, "Hello, who woke me?");
+
        WARN_ON(!mutex_is_locked(&dev->mode_config.mutex));
 
        if (!dev->mode_config.poll_enabled || !drm_kms_helper_poll)
Comment 9 Petri Latvala 2016-10-11 11:42:53 UTC
Created attachment 127209 [details]
dmesg

Both patches applied.

Don't know which stack trace is relevant. Here's before the dp_aux_ch error:

[ 2206.737814] Hello, who woke me?
[ 2206.737841] Modules linked in: snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core intel_power\
clamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_pcm lpc_ich i2c_designware_platform i2c_designware_core i915 r8169 mii i2c_hid sdhci_acpi \
sdhci mmc_core
[ 2206.737847] CPU: 0 PID: 233 Comm: kworker/0:2 Tainted: G        W       4.8.0-pl-manual+ #2
[ 2206.737849] Hardware name: GIGABYTE GB-BXBT-1900/MZBAYAB-00, BIOS F6 02/17/2015
[ 2206.737903] Workqueue: events i915_hpd_poll_init_work [i915]
[ 2206.737911]  0000000000000000 ffff88013ababd40 ffffffff813ca25d ffff88013ababd90
[ 2206.737916]  0000000000000000 ffff88013ababd80 ffffffff8107c2b6 0000008600000000
[ 2206.737922]  ffff8801324e0000 ffff8801324e0230 ffff8801324e8d90 0000000000000000
[ 2206.737923] Call Trace:
[ 2206.737931]  [<ffffffff813ca25d>] dump_stack+0x4f/0x72
[ 2206.737936]  [<ffffffff8107c2b6>] __warn+0xc6/0xe0
[ 2206.737941]  [<ffffffff8107c31a>] warn_slowpath_fmt+0x4a/0x50
[ 2206.737946]  [<ffffffff8109faad>] ? finish_task_switch+0x7d/0x1d0
[ 2206.737951]  [<ffffffff814ce5f0>] drm_kms_helper_poll_enable_locked+0x20/0x100
[ 2206.738004]  [<ffffffffa014eb38>] i915_hpd_poll_init_work+0xb8/0xd0 [i915]
[ 2206.738010]  [<ffffffff81093b6c>] process_one_work+0x14c/0x480
[ 2206.738014]  [<ffffffff81093ee3>] worker_thread+0x43/0x4e0
[ 2206.738019]  [<ffffffff81093ea0>] ? process_one_work+0x480/0x480
[ 2206.738023]  [<ffffffff81093ea0>] ? process_one_work+0x480/0x480
[ 2206.738027]  [<ffffffff810994e4>] kthread+0xc4/0xe0
[ 2206.738033]  [<ffffffff8176bdb9>] ? _raw_spin_unlock_irq+0x9/0x10
[ 2206.738038]  [<ffffffff8176c3bf>] ret_from_fork+0x1f/0x40
[ 2206.738042]  [<ffffffff81099420>] ? kthread_worker_fn+0x180/0x180
[ 2206.738045] ---[ end trace ec3ff52ee7faea67 ]---

And after:

[ 2209.156626] Hello, who woke me?
[ 2209.156655] Modules linked in: snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core intel_power\
clamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_pcm lpc_ich i2c_designware_platform i2c_designware_core i915 r8169 mii i2c_hid sdhci_acpi \
sdhci mmc_core
[ 2209.156661] CPU: 3 PID: 10932 Comm: kworker/u8:31 Tainted: G        W       4.8.0-pl-manual+ #2
[ 2209.156663] Hardware name: GIGABYTE GB-BXBT-1900/MZBAYAB-00, BIOS F6 02/17/2015
[ 2209.156671] Workqueue: events_unbound async_run_entry_fn
[ 2209.156678]  0000000000000000 ffff88013499bc38 ffffffff813ca25d ffff88013499bc88
[ 2209.156684]  0000000000000000 ffff88013499bc78 ffffffff8107c2b6 00000086810d8d21
[ 2209.156689]  ffff8801324e0000 ffff8801324e0000 ffffffffa017e3c0 ffffffff81412490
[ 2209.156691] Call Trace:
[ 2209.156698]  [<ffffffff813ca25d>] dump_stack+0x4f/0x72
[ 2209.156703]  [<ffffffff8107c2b6>] __warn+0xc6/0xe0
[ 2209.156710]  [<ffffffff81412490>] ? pci_pm_resume_noirq+0xa0/0xa0
[ 2209.156714]  [<ffffffff8107c31a>] warn_slowpath_fmt+0x4a/0x50
[ 2209.156720]  [<ffffffff8109328e>] ? __queue_delayed_work+0x7e/0x180
[ 2209.156725]  [<ffffffff814ce5f0>] drm_kms_helper_poll_enable_locked+0x20/0x100
[ 2209.156729]  [<ffffffff814cebe1>] drm_kms_helper_poll_enable+0x21/0x30
[ 2209.156776]  [<ffffffffa00c3781>] i915_drm_resume+0x131/0x180 [i915]
[ 2209.156821]  [<ffffffffa00c37e9>] i915_pm_restore+0x19/0x20 [i915]
[ 2209.156866]  [<ffffffffa00c3809>] i915_pm_resume+0x9/0x10 [i915]
[ 2209.156871]  [<ffffffff81412501>] pci_pm_resume+0x71/0xe0
[ 2209.156876]  [<ffffffff815151fb>] dpm_run_callback+0x3b/0x140
[ 2209.156881]  [<ffffffff815157f0>] device_resume+0x90/0x250
[ 2209.156885]  [<ffffffff815159c8>] async_resume+0x18/0x40
[ 2209.156890]  [<ffffffff8109b872>] async_run_entry_fn+0x32/0x140
[ 2209.156894]  [<ffffffff81093b6c>] process_one_work+0x14c/0x480
[ 2209.156899]  [<ffffffff81093ee3>] worker_thread+0x43/0x4e0
[ 2209.156903]  [<ffffffff81093ea0>] ? process_one_work+0x480/0x480
[ 2209.156907]  [<ffffffff810994e4>] kthread+0xc4/0xe0
[ 2209.156913]  [<ffffffff8176bdb9>] ? _raw_spin_unlock_irq+0x9/0x10
[ 2209.156918]  [<ffffffff8176c3bf>] ret_from_fork+0x1f/0x40
[ 2209.156922]  [<ffffffff81099420>] ? kthread_worker_fn+0x180/0x180
[ 2209.156925] ---[ end trace ec3ff52ee7faea68 ]---
[ 2209.157009] PM: resume of devices complete after 2301.212 msecs
Comment 10 Ville Syrjala 2016-10-12 10:38:27 UTC
Assigning to Lyude.
Comment 11 Lyude Paul 2016-10-13 20:03:53 UTC
So for whatever reason I'm having no luck with getting this bug to show itself even once on this machine. I'm probably going to look over the code a few times and see if I can guess what fixes it by sending a couple of patches to the trybot, but it would be a lot easier if I could actually see it here…

So, I have one theory as to why I'm not seeing this with my NUC: the machines I've seen logs from on this bug report have multiple cores, while mine only has a single core. I'm thinking it's possible not having multiple cores here means there's almost no chance of the hotplug polling running at the same time as any of the other resume tasks in the kernel. I'm trying to get my hands on another NUC with more then one processor, but for the time being would one of you be able to run this test with "maxcpus=1" added to the kernel cmdline and see if the problem still happens?
Comment 12 Petri Latvala 2016-10-14 10:43:01 UTC
(In reply to Lyude Paul from comment #11)
> So, I have one theory as to why I'm not seeing this with my NUC: the
> machines I've seen logs from on this bug report have multiple cores, while
> mine only has a single core. I'm thinking it's possible not having multiple
> cores here means there's almost no chance of the hotplug polling running at
> the same time as any of the other resume tasks in the kernel. I'm trying to
> get my hands on another NUC with more then one processor, but for the time
> being would one of you be able to run this test with "maxcpus=1" added to
> the kernel cmdline and see if the problem still happens?

I've been running the test repeatedly for about an hour now with that cmdline and the problem does not reproduce indeed.
Comment 13 Petri Latvala 2016-10-14 12:29:10 UTC
(In reply to Petri Latvala from comment #12)
> I've been running the test repeatedly for about an hour now with that
> cmdline and the problem does not reproduce indeed.

Spoke too soon. Reproduced eventually.
Comment 14 Ville Syrjala 2016-10-14 16:03:37 UTC
(In reply to Lyude Paul from comment #11)
> So for whatever reason I'm having no luck with getting this bug to show
> itself even once on this machine. I'm probably going to look over the code a
> few times and see if I can guess what fixes it by sending a couple of
> patches to the trybot, but it would be a lot easier if I could actually see
> it here…
> 
> So, I have one theory as to why I'm not seeing this with my NUC: the
> machines I've seen logs from on this bug report have multiple cores, while
> mine only has a single core. I'm thinking it's possible not having multiple
> cores here means there's almost no chance of the hotplug polling running at
> the same time as any of the other resume tasks in the kernel. I'm trying to
> get my hands on another NUC with more then one processor, but for the time
> being would one of you be able to run this test with "maxcpus=1" added to
> the kernel cmdline and see if the problem still happens?

You could always try whacking a big msleep() somewhere appropriate.
Comment 15 Lyude Paul 2016-10-14 17:28:17 UTC
(In reply to Petri Latvala from comment #13)
> (In reply to Petri Latvala from comment #12)
> > I've been running the test repeatedly for about an hour now with that
> > cmdline and the problem does not reproduce indeed.
> 
> Spoke too soon. Reproduced eventually.

Well judging from the comment times maxcpus=1 made it take 3 hours to hit the issue again, so that's still pretty useful :).
Comment 16 Lyude Paul 2016-10-20 19:48:45 UTC
So it looks like I was getting a little confused by the Broken irq affinity errors and may have been seeing this bug all along on this NUC without realizing it. I managed to reproduce the hpd poll initialization worker getting called in the midst of late suspend, which should be happening well after we've disabled polling. The cause seems to be that we now deinitialize power wells in late suspend, which ends up causing us to call intel_hpd_poll_init(). So, just checking if we're in suspend before initializing polling on power well deinit should fix this (hopefully).
Comment 17 Lyude Paul 2016-10-20 19:49:18 UTC
Created attachment 127435 [details] [review]
Don't allow polling in late suspend
Comment 18 Petri Latvala 2016-10-21 11:14:20 UTC
(In reply to Lyude Paul from comment #17)
> Created attachment 127435 [details] [review] [review]
> Don't allow polling in late suspend

Repeated runs so far (for about an hour) with this patch haven't reproduced the error. I'll leave this running over the weekend.
Comment 19 Petri Latvala 2016-10-24 09:38:21 UTC
(In reply to Petri Latvala from comment #18)
> (In reply to Lyude Paul from comment #17)
> > Created attachment 127435 [details] [review] [review] [review]
> > Don't allow polling in late suspend
> 
> Repeated runs so far (for about an hour) with this patch haven't reproduced
> the error. I'll leave this running over the weekend.

The error didn't appear at all over the weekend.
Comment 20 Lyude Paul 2016-10-24 15:06:09 UTC
(In reply to Petri Latvala from comment #19)
> (In reply to Petri Latvala from comment #18)
> > (In reply to Lyude Paul from comment #17)
> > > Created attachment 127435 [details] [review] [review] [review] [review]
> > > Don't allow polling in late suspend
> > 
> > Repeated runs so far (for about an hour) with this patch haven't reproduced
> > the error. I'll leave this running over the weekend.
> 
> The error didn't appear at all over the weekend.

Thank you very much for testing this. I'll get the patch sent out to the ML list and hopefully push it at the end of this week.
Comment 21 Jani Saarinen 2016-10-25 14:25:13 UTC
Thanks Lyude.
Comment 22 Jani Nikula 2016-10-29 12:15:49 UTC
commit b64b540931483cca3200d98756bed6ad0e01d75c
Author: Lyude <lyude@redhat.com>
Date:   Wed Oct 26 12:36:09 2016 -0400

    drm/i915/vlv: Prevent enabling hpd polling in late suspend
Comment 23 Jari Tahvanainen 2016-11-03 08:39:05 UTC
Verified. No failure "dp_aux_ch not done status" on fi-byt-j1900 during last +40 CI_DRM runs. Latest dmesg visible on 1725 (2016-10-24).
Comment 24 Jari Tahvanainen 2016-11-03 08:44:26 UTC
Closing verified+fixed.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.