Bug 98670 - [BAT] WARN_ON(dev_priv->gt.awake) during drv_module_reload_basic
Summary: [BAT] WARN_ON(dev_priv->gt.awake) during drv_module_reload_basic
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: high major
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-11-10 09:28 UTC by Tvrtko Ursulin
Modified: 2017-07-24 22:40 UTC (History)
2 users (show)

See Also:
i915 platform: ALL
i915 features: GEM/Other


Attachments

Description Tvrtko Ursulin 2016-11-10 09:28:42 UTC
https://intel-gfx-ci.01.org/CI/Patchwork_2944/fi-bsw-n3050/igt@drv_module_reload_basic.html

[   63.113930] ------------[ cut here ]------------
[   63.113997] WARNING: CPU: 0 PID: 7252 at drivers/gpu/drm/i915/i915_gem.c:4482 i915_gem_suspend+0x1a9/0x1b0 [i915]
[   63.114408] WARN_ON(dev_priv->gt.awake)
[   63.114411] Modules linked in:
[   63.114413]  i915(-) snd_hda_codec_hdmi coretemp crct10dif_pclmul crc32_pclmul
[   63.114429]  ghash_clmulni_intel
[   63.114431]  snd_hda_codec_realtek snd_hda_codec_generic lpc_ich snd_hda_codec snd_hwdep snd_hda_core snd_pcm i2c_designware_platform i2c_designware_core r8169 mii sdhci_acpi sdhci i2c_hid mmc_core [last unloaded: snd_hda_intel]
[   63.114457] CPU: 0 PID: 7252 Comm: rmmod Not tainted 4.9.0-rc4-CI-Patchwork_2944+ #1
[   63.114460] Hardware name:                  /NUC5CPYB, BIOS PYBSWCEL.86A.0055.2016.0812.1130 08/12/2016
[   63.114463]  ffffc90000573d08 ffffffff8142e095 ffffc90000573d58 0000000000000000
[   63.114470]  ffffc90000573d48 ffffffff8107e516 0000118200000000 ffff88016fd4f8a0
[   63.114476]  ffff88016fd4f8a0 ffff88016fd48000 ffff88016fd48068 0000556b70d34200
[   63.114483] Call Trace:
[   63.114491]  [<ffffffff8142e095>] dump_stack+0x67/0x92
[   63.114496]  [<ffffffff8107e516>] __warn+0xc6/0xe0
[   63.114499]  [<ffffffff8107e57a>] warn_slowpath_fmt+0x4a/0x50
[   63.114537]  [<ffffffffa037c999>] i915_gem_suspend+0x1a9/0x1b0 [i915]
[   63.114569]  [<ffffffffa0338ede>] i915_driver_unload+0x1e/0x1a0 [i915]
[   63.114604]  [<ffffffffa0342a24>] i915_pci_remove+0x14/0x20 [i915]
[   63.114609]  [<ffffffff8147d694>] pci_device_remove+0x34/0xb0
[   63.114614]  [<ffffffff8157b81c>] __device_release_driver+0x9c/0x150
[   63.114617]  [<ffffffff8157c396>] driver_detach+0xb6/0xc0
[   63.114621]  [<ffffffff8157b2b3>] bus_remove_driver+0x53/0xd0
[   63.114624]  [<ffffffff8157ce57>] driver_unregister+0x27/0x50
[   63.114628]  [<ffffffff8147c6e5>] pci_unregister_driver+0x25/0x70
[   63.114666]  [<ffffffffa04003ec>] i915_exit+0x1a/0x71 [i915]
[   63.114671]  [<ffffffff811135f3>] SyS_delete_module+0x193/0x1e0
[   63.114677]  [<ffffffff8181722e>] entry_SYSCALL_64_fastpath+0x1c/0xb1
[   63.114680] ---[ end trace 25c606c49aee04a1 ]---
[   64.868229] Setting dangerous option inject_load_failure - tainting kernel
[   66.954555] Setting dangerous option inject_load_failure - tainting kernel
[   69.032090] Setting dangerous option inject_load_failure - tainting kernel
[   72.114686] Setting dangerous option inject_load_failure - tainting kernel
Comment 1 Chris Wilson 2016-11-18 12:36:44 UTC
Had a thought that this may be related to

commit 4302055b29cbc8566aaa5eb7f594ea9cc78ebf41
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Nov 15 16:46:20 2016 +0000

    drm/i915: Be more careful to drop the GT wakeref
    
    Since we can retire requests from multiple paths, we cannot assume that
    i915_gem_retire_requests() is the sole path on which we can transition
    to gt.active_requests == 0. A consequence of this is that we would skip
    the function if we had already retired all the requests and not
    scheduled the idle worker.
    
    This is fallout from changing the routine from considering active_engines
    (for which it was the only consumer) to active_requests.

Optimistically closing to provoke CI tripping over it again.
Comment 2 Petri Latvala 2016-12-05 09:57:37 UTC
Reopening, happening on CI again. drv_module_reload@basic-reload-inject is triggering this fairly often.

Happened on KBL and ILK so far.

https://intel-gfx-ci.01.org/CI/CI_DRM_1908/fi-kbl-7500u/igt@drv_module_reload@basic-reload-inject.html
Comment 3 Chris Wilson 2016-12-05 10:10:14 UTC
Note also that debugobjects (now enabled) is picking up a still active timer:

[   31.231857] WARNING: CPU: 2 PID: 6400 at drivers/gpu/drm/i915/i915_gem.c:4239 i915_gem_suspend+0x181/0x190 [i915]
[   31.231859] WARN_ON(dev_priv->gt.awake)
[   31.231860] Modules linked in:
[   31.231862]  i915(-) snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_codec crct10dif_pclmul crc32_pclmul snd_hwdep ghash_clmulni_intel snd_hda_core snd_pcm mei_me mei e1000e ptp pps_core i2c_hid [last unloaded: snd_hda_intel]
[   31.231885] CPU: 2 PID: 6400 Comm: drv_module_relo Tainted: G     U  W       4.9.0-rc8-CI-CI_DRM_1908+ #1
[   31.231887] Hardware name: GIGABYTE GB-BKi7A-7500/MFLP7AP-00, BIOS F1 07/27/2016
[   31.231888]  ffffc90000587d18 ffffffff8142e3d5 ffffc90000587d68 0000000000000000
[   31.231894]  ffffc90000587d58 ffffffff8107e4d6 0000108f00000000 ffff88025d9c0000
[   31.231901]  0000000000000000 ffff88025d9c0068 ffffffffa0147960 0000000000000000
[   31.231909] Call Trace:
[   31.231916]  [<ffffffff8142e3d5>] dump_stack+0x67/0x92
[   31.231922]  [<ffffffff8107e4d6>] __warn+0xc6/0xe0
[   31.231925]  [<ffffffff8107e53a>] warn_slowpath_fmt+0x4a/0x50
[   31.231956]  [<ffffffffa0059281>] i915_gem_suspend+0x181/0x190 [i915]
[   31.231977]  [<ffffffffa0016f2e>] i915_driver_unload+0x1e/0x190 [i915]
[   31.232000]  [<ffffffffa0020a44>] i915_pci_remove+0x14/0x20 [i915]
[   31.232003]  [<ffffffff8147da84>] pci_device_remove+0x34/0xb0
[   31.232008]  [<ffffffff8157c40c>] __device_release_driver+0x9c/0x150
[   31.232011]  [<ffffffff8157cf86>] driver_detach+0xb6/0xc0
[   31.232014]  [<ffffffff8157bea3>] bus_remove_driver+0x53/0xd0
[   31.232016]  [<ffffffff8157da47>] driver_unregister+0x27/0x50
[   31.232020]  [<ffffffff8147cad5>] pci_unregister_driver+0x25/0x70
[   31.232055]  [<ffffffffa01056f4>] i915_exit+0x1a/0x71 [i915]
[   31.232060]  [<ffffffff811134f3>] SyS_delete_module+0x193/0x1e0
[   31.232064]  [<ffffffff818192ae>] entry_SYSCALL_64_fastpath+0x1c/0xb1
[   31.232067] ---[ end trace 59f0c2ad2c693406 ]---
[   31.281573] ------------[ cut here ]------------
[   31.281581] WARNING: CPU: 3 PID: 6400 at lib/debugobjects.c:263 debug_print_object+0x87/0xb0
[   31.281587] ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x20
[   31.281598] Modules linked in: i915(-) snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_codec crct10dif_pclmul crc32_pclmul snd_hwdep ghash_clmulni_intel snd_hda_core snd_pcm mei_me mei e1000e ptp pps_core i2c_hid [last unloaded: snd_hda_intel]
[   31.281622] CPU: 3 PID: 6400 Comm: drv_module_relo Tainted: G     U  W       4.9.0-rc8-CI-CI_DRM_1908+ #1
[   31.281624] Hardware name: GIGABYTE GB-BKi7A-7500/MFLP7AP-00, BIOS F1 07/27/2016
[   31.281625]  ffffc90000587be8 ffffffff8142e3d5 ffffc90000587c38 0000000000000000
[   31.281630]  ffffc90000587c28 ffffffff8107e4d6 0000010700000086 ffff880258cf0bc8
[   31.281634]  ffffffff81e4ae40 ffffffff81c3d55a ffffc90000587d00 0000000000000004
[   31.281638] Call Trace:
[   31.281643]  [<ffffffff8142e3d5>] dump_stack+0x67/0x92
[   31.281647]  [<ffffffff8107e4d6>] __warn+0xc6/0xe0
[   31.281650]  [<ffffffff8107e53a>] warn_slowpath_fmt+0x4a/0x50
[   31.281652]  [<ffffffff8144dba7>] debug_print_object+0x87/0xb0
[   31.281656]  [<ffffffff8109b5a0>] ? queue_work_on+0x70/0x70
[   31.281658]  [<ffffffff8144eaf5>] debug_check_no_obj_freed+0x1c5/0x220
[   31.281663]  [<ffffffff811768ed>] __free_pages_ok+0x15d/0x630
[   31.281666]  [<ffffffff8144eb3a>] ? debug_check_no_obj_freed+0x20a/0x220
[   31.281669]  [<ffffffff81557841>] ? drm_dev_unref+0xc1/0xe0
[   31.281672]  [<ffffffff81178884>] __free_pages+0x14/0x20
[   31.281676]  [<ffffffff811d19b6>] kfree+0x246/0x2e0
[   31.281679]  [<ffffffff81557841>] drm_dev_unref+0xc1/0xe0
[   31.281705]  [<ffffffffa0020a4c>] i915_pci_remove+0x1c/0x20 [i915]
[   31.281709]  [<ffffffff8147da84>] pci_device_remove+0x34/0xb0
[   31.281713]  [<ffffffff8157c40c>] __device_release_driver+0x9c/0x150
[   31.281716]  [<ffffffff8157cf86>] driver_detach+0xb6/0xc0
[   31.281719]  [<ffffffff8157bea3>] bus_remove_driver+0x53/0xd0
[   31.281721]  [<ffffffff8157da47>] driver_unregister+0x27/0x50
[   31.281724]  [<ffffffff8147cad5>] pci_unregister_driver+0x25/0x70
[   31.281761]  [<ffffffffa01056f4>] i915_exit+0x1a/0x71 [i915]
[   31.281765]  [<ffffffff811134f3>] SyS_delete_module+0x193/0x1e0
[   31.281769]  [<ffffffff818192ae>] entry_SYSCALL_64_fastpath+0x1c/0xb1

The code currently does:

switch-to-kernel; wait-for-idle; retire-all; GEM_BUG_ON(!active_requests);

after which point we will have queued the idle_worker (queued when active_requests -> 0). Then we flush the idle work, which should complete the task, and certainly not leave the idle work pending. I don't see the hole, but that still active delayed work is very suspicious.
Comment 4 Chris Wilson 2016-12-05 10:17:00 UTC
Looks more likely that we are hitting one of the skip idle() checks, maybe

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 9cc50247ad44..fa9b1b68f501 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -4314,7 +4314,8 @@ int i915_gem_suspend(struct drm_i915_private *dev_priv)
        cancel_delayed_work_sync(&dev_priv->gpu_error.hangcheck_work);
        cancel_delayed_work_sync(&dev_priv->gt.retire_work);
        flush_delayed_work(&dev_priv->gt.idle_work);
-       flush_work(&dev_priv->mm.free_work);
+       while (flush_work(&dev_priv->mm.free_work))
+               ;
 
        /* Assert that we sucessfully flushed all the work and
         * reset the GPU back to its idle, low power state.

to really drain the work.
Comment 5 Chris Wilson 2016-12-05 10:17:53 UTC
s/free_work/idle_work/ ofc
Comment 6 Chris Wilson 2017-02-04 20:51:49 UTC
Been quiet for over a month, looks like the flush until it stops shaking is fine wallpaper.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.