Bug 101627 - [HSW]drv_selftest/mock_requests - ida_remove called for id=1 which is not allocated.
Summary: [HSW]drv_selftest/mock_requests - ida_remove called for id=1 which is not all...
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-06-28 14:33 UTC by mwa
Modified: 2017-07-21 16:45 UTC (History)
1 user (show)

See Also:
i915 platform: HSW
i915 features: GEM/Other


Attachments

Description mwa 2017-06-28 14:33:51 UTC
[  502.196556] [IGT] drv_selftest: starting subtest mock_requests
[  503.482723] Setting dangerous option mock_selftests - tainting kernel
[  503.523225] i915: Performing mock selftests with st_random_seed=0x1f3b31ec st_timeout=1000
[  504.560235] ida_remove called for id=1 which is not allocated.
[  504.560329] ------------[ cut here ]------------
[  504.560389] WARNING: CPU: 0 PID: 126 at lib/idr.c:383 ida_remove+0x1c0/0x1d0
[  504.560444] Modules linked in: i915(+) drm_kms_helper drm netconsole xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_broute bridge stp llc ebtable_nat ip6table_raw ip6table_security ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle iptable_raw iptable_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle ebtable_filter ebtables ip6table_filter ip6_tables snd_usb_audio snd_usbmidi_lib intel_rapl snd_rawmidi eeepc_wmi x86_pkg_temp_thermal asus_wmi sparse_keymap iTCO_wdt coretemp rfkill crct10dif_pclmul iTCO_vendor_support crc32_pclmul ghash_clmulni_intel snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi gspca_zc3xx gspca_main i2c_i801
[  504.560762]  v4l2_common snd_hda_codec videodev snd_hwdep snd_hda_core media joydev snd_seq snd_seq_device snd_pcm mei_me snd_timer lpc_ich snd mei soundcore shpchp wmi tpm_tis tpm_tis_core tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc prime_numbers crc32c_intel i2c_algo_bit r8169 mii video [last unloaded: i915]
[  504.560923] CPU: 0 PID: 126 Comm: kworker/u8:4 Tainted: G    BU  W       4.12.0-rc7-drm-intel-debug+ #25
[  504.560995] Hardware name: ASUS All Series/Z87-K, BIOS 1401 07/29/2014
[  504.561157] Workqueue: mock contexts_free_worker [i915]
[  504.561214] task: ffff8801b52ea840 task.stack: ffff8801b5310000
[  504.561276] RIP: 0010:ida_remove+0x1c0/0x1d0
[  504.561330] RSP: 0018:ffff8801b5317ac8 EFLAGS: 00010082
[  504.561385] RAX: 0000000000000032 RBX: 1ffff10036a62f5b RCX: ffffffff9ba1208a
[  504.561446] RDX: 0000000000000000 RSI: ffff8801d7e17448 RDI: ffff8801d7e17448
[  504.561504] RBP: ffff8801b5317b80 R08: 0000000000000003 R09: 0000000000000000
[  504.561561] R10: ffff8801b5317af0 R11: ffffed0036a62e9d R12: ffff88016ca78418
[  504.561617] R13: 0000000000000202 R14: ffff88016ca78418 R15: 0000000000000001
[  504.561674] FS:  0000000000000000(0000) GS:ffff8801d7e00000(0000) knlGS:0000000000000000
[  504.561741] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  504.561795] CR2: 00007f83089f1000 CR3: 000000012ee13000 CR4: 00000000001406f0
[  504.561853] Call Trace:
[  504.561907]  ? ida_get_new_above+0x570/0x570
[  504.561964]  ? do_raw_spin_lock+0x9c/0x140
[  504.562024]  ida_simple_remove+0x2b/0x50
[  504.562177]  i915_gem_context_free+0x365/0x5d0 [i915]
[  504.562336]  ? needs_pd_load_pre+0x140/0x140 [i915]
[  504.562398]  ? lock_acquire+0x14a/0x340
[  504.562555]  contexts_free+0x4a/0xb0 [i915]
[  504.562711]  contexts_free_worker+0x29/0x40 [i915]
[  504.562771]  process_one_work+0x581/0xe70
[  504.562832]  ? pwq_dec_nr_in_flight+0x200/0x200
[  504.562907]  ? lock_acquire+0x14a/0x340
[  504.562979]  ? lock_acquire+0x14a/0x340
[  504.563040]  ? do_raw_spin_lock+0x9c/0x140
[  504.563102]  worker_thread+0x396/0x9a0
[  504.563187]  ? process_one_work+0xe70/0xe70
[  504.563243]  ? trace_hardirqs_on_caller+0x18e/0x390
[  504.563307]  kthread+0x1fc/0x2f0
[  504.563362]  ? process_one_work+0xe70/0xe70
[  504.563420]  ? kthread_create_on_node+0xc0/0xc0
[  504.563481]  ret_from_fork+0x2a/0x40
[  504.563543] Code: 39 ff ff ff 4c 89 e7 e8 8f 65 b7 ff 49 8b 04 24 48 83 f8 02 0f 85 79 ff ff ff eb ba 44 89 fe 48 c7 c7 80 a9 83 9c e8 aa 9b a5 ff <0f> ff e9 61 ff ff ff e8 94 01 78 ff 0f 1f 40 00 55 48 b8 00 00
[  504.563711] ---[ end trace 23c7f67e1f726055 ]---
Comment 1 Chris Wilson 2017-06-28 15:13:39 UTC
The usual case for this was the worker running after i915_gem_context_fini(). The requirement there was that flush_workqueue() was run before i915_gem_context_fini().

i.e. we do

static void mock_device_release(struct drm_device *dev)
{
        struct drm_i915_private *i915 = to_i915(dev);
        struct intel_engine_cs *engine;
        enum intel_engine_id id;

        mutex_lock(&i915->drm.struct_mutex);
        mock_device_flush(i915);
        i915_gem_contexts_lost(i915);
        mutex_unlock(&i915->drm.struct_mutex);

        cancel_delayed_work_sync(&i915->gt.retire_work);
        cancel_delayed_work_sync(&i915->gt.idle_work);
        flush_workqueue(i915->wq);

        mutex_lock(&i915->drm.struct_mutex);
        for_each_engine(engine, i915, id)
                mock_engine_free(engine);
        i915_gem_contexts_fini(i915);
        mutex_unlock(&i915->drm.struct_mutex);

...

(I'm assuming that the llist is correct and we can't free the same context twice, kasan?)

Above though we have an issue in that we need to use drain_workqueue() instead. Otherwise, I can't see where we might have more requests to retire. Probably deserves a few bug ons to assert that all requests are retired and we are now idle, ala i915_gem_suspend(). Hmm, probably better to rework it to use i915_gem_suspend as well.
Comment 2 mwa 2017-06-29 22:17:40 UTC
Yeah i915_gem_suspend() seems to do the trick, though not sure what to do about the gt.awake and i915_gem_sanitize() business.
Comment 3 Chris Wilson 2017-07-19 12:26:08 UTC
commit 3b19f16a556446c144a1f921444931b0cf9447ab
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Jul 18 14:41:24 2017 +0100

    drm/i915: Drain the device workqueue on unload
    
    Workers on the i915->wq may rearm themselves so for completeness we need
    to replace our flush_workqueue() with a call to drain_workqueue() before
    unloading the device.
    
    v2: Reinforce the drain_workqueue with an preceding rcu_barrier() as a
    few of the tasks that need to be drained may first be armed by RCU.

Should fix it in the short term. Reusing i915_gem_suspend() can hopefully be done at leisure.
Comment 4 mwa 2017-07-19 13:04:15 UTC
Hmm, still hitting the issue...
Comment 5 Chris Wilson 2017-07-19 13:20:26 UTC
Do we hit

diff --git a/drivers/gpu/drm/i915/selftests/mock_engine.c b/drivers/gpu/drm/i915/selftests/mock_engine.c
index 5b18a2dc19a8..df2cc9132d90 100644
--- a/drivers/gpu/drm/i915/selftests/mock_engine.c
+++ b/drivers/gpu/drm/i915/selftests/mock_engine.c
@@ -195,7 +195,7 @@ void mock_engine_free(struct intel_engine_cs *engine)
 
        GEM_BUG_ON(timer_pending(&mock->hw_delay));
 
-       if (engine->last_retired_context)
+       if (WARN_ON(engine->last_retired_context))
                engine->context_unpin(engine, engine->last_retired_context);
 
        intel_engine_fini_breadcrumbs(engine);

?

That should have been done by the i915_gem_contexts_lost() following the device flush.
Comment 6 mwa 2017-07-19 13:32:43 UTC
Yes.
Comment 7 Chris Wilson 2017-07-19 13:46:28 UTC
Ah, you don't have contexts_lost there because I added it later:

commit b8d0658849d52110c72b44860f86fb4c544de625
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Apr 20 11:52:34 2017 +0100

    drm/i915: Keep a recent cache of freed contexts objects for reuse
    
    Keep the recently freed context objects for reuse. This allows us to use
    the current GGTT bindings and dma bound pages, avoiding any clflushes as
    required. We mark the objects as purgeable under memory pressure, and
    reap the list of freed objects as soon as the device is idle.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

diff --git a/drivers/gpu/drm/i915/selftests/mock_gem_device.c b/drivers/gpu/drm/i915/selftests/mock_gem_device.c
index d451dfbe9bbb..dda413c95b89 100644
--- a/drivers/gpu/drm/i915/selftests/mock_gem_device.c
+++ b/drivers/gpu/drm/i915/selftests/mock_gem_device.c
@@ -54,6 +54,7 @@ static void mock_device_release(struct drm_device *dev)
 
        mutex_lock(&i915->drm.struct_mutex);
        mock_device_flush(i915);
+       i915_gem_contexts_lost(i915);
        mutex_unlock(&i915->drm.struct_mutex);
 
        cancel_delayed_work_sync(&i915->gt.retire_work);


which explains why I haven't it...
Comment 8 Chris Wilson 2017-07-19 18:19:33 UTC
commit 56d27666f8fa21835724217b0c67d42b769b5723
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Jul 19 14:59:57 2017 +0100

    drm/i915/selftests: Mark contexts as lost during freeing of mock device
    
    We need to unpin the last retired context early in the shutdown sequence
    so that its RCU free is done before we try to free the context ida. I
    included this in a later patch ("drm/i915: Keep a recent cache of freed
    contexts objects for reuse") and so missed that the selftests were broken
    in the meantime.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.