Bug 76582 - igt/drv_module_reload causes call trace
Summary: igt/drv_module_reload causes call trace
Status: CLOSED INVALID
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: All Linux (All)
: medium normal
Assignee: Damien Lespiau
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-03-25 05:48 UTC by lu hua
Modified: 2017-10-06 14:39 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg (33.37 KB, text/plain)
2014-03-25 05:48 UTC, lu hua
no flags Details

Description lu hua 2014-03-25 05:48:52 UTC
Created attachment 96341 [details]
dmesg

System Environment:
--------------------------
Platform: Haswell
Kernel: drm-intel-fixes/0f4706d2740f2a221cd502922b22e522009041d9

Bug detailed description:
-----------------------------
It causes call trace on all platforms with -nightly, -fixes and -queued kernel.Test on earlier commit, it also has this issue.

Call trace:
[  200.944162] ------------[ cut here ]------------
[  200.944258] WARNING: CPU: 7 PID: 4296 at fs/sysfs/group.c:216 device_del+0x39/0x16a()
[  200.944406] sysfs group ffffffff81af86d0 not found for kobject 'i2c-6'
[  200.944500] Modules linked in: i915(-) drm_kms_helper drm ipv6 dm_mod snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi pcspkr serio_raw i2c_i801 iTCO_wdt iTCO_vendor_support lpc_ich mfd_core snd_hda_codec snd_hwdep snd_pcm snd_timer snd soundcore battery tpm_infineon tpm_tis tpm wmi acpi_cpufreq video button [last unloaded: snd_hda_intel]
[  200.945759] CPU: 7 PID: 4296 Comm: rmmod Tainted: G        W    3.14.0-rc5_drm-intel-fixes_0f4706_20140323+ #950
[  200.945906] Hardware name: ASUS All Series/Z87-EXPERT, BIOS 1008 05/17/2013
[  200.945998]  0000000000000000 0000000000000009 ffffffff81716b43 ffff880251395be8
[  200.946256]  ffffffff81035052 ffffffff00000001 ffffffff81379b9f ffffffff00000001
[  200.946514]  ffff880252747c00 0000000000000002 ffff88025269e150 0000000000000000
[  200.946773] Call Trace:
[  200.946861]  [<ffffffff81716b43>] ? dump_stack+0x41/0x51
[  200.946950]  [<ffffffff81035052>] ? warn_slowpath_common+0x73/0x8b
[  200.947040]  [<ffffffff81379b9f>] ? device_del+0x39/0x16a
[  200.947128]  [<ffffffff81035102>] ? warn_slowpath_fmt+0x45/0x4a
[  200.947217]  [<ffffffff81379b9f>] ? device_del+0x39/0x16a
[  200.947306]  [<ffffffff81379cd9>] ? device_unregister+0x9/0x12
[  200.947395]  [<ffffffff81379d16>] ? device_destroy+0x34/0x3a
[  200.947485]  [<ffffffff816212d6>] ? i2cdev_detach_adapter+0x3e/0x42
[  200.947574]  [<ffffffff8171ef87>] ? notifier_call_chain+0x2e/0x59
[  200.947665]  [<ffffffff8104e49b>] ? __blocking_notifier_call_chain+0x43/0x5d
[  200.947756]  [<ffffffff81379b97>] ? device_del+0x31/0x16a
[  200.947846]  [<ffffffff81379cd9>] ? device_unregister+0x9/0x12
[  200.947936]  [<ffffffff81620d27>] ? i2c_del_adapter+0x190/0x1d6
[  200.948032]  [<ffffffffa02e557a>] ? intel_dp_encoder_destroy+0x1d/0x62 [i915]
[  200.948126]  [<ffffffffa01d220e>] ? drm_mode_config_cleanup+0x2d/0x216 [drm]
[  200.948220]  [<ffffffffa01ceb54>] ? drm_sysfs_connector_remove+0x74/0x80 [drm]
[  200.948364]  [<ffffffffa02daf6f>] ? intel_modeset_cleanup+0xd1/0xe0 [i915]
[  200.948456]  [<ffffffffa02af96c>] ? i915_driver_unload+0xb6/0x2a0 [i915]
[  200.948548]  [<ffffffffa01cc127>] ? drm_dev_unregister+0x21/0x88 [drm]
[  200.948641]  [<ffffffffa01cc8fe>] ? drm_put_dev+0x48/0x51 [drm]
[  200.948730]  [<ffffffff812f8cee>] ? pci_device_remove+0x38/0x80
[  200.948819]  [<ffffffff8137c23d>] ? __device_release_driver+0x82/0xdb
[  200.948910]  [<ffffffff8137c914>] ? driver_detach+0x6e/0x9a
[  200.948998]  [<ffffffff8137c0a2>] ? bus_remove_driver+0x60/0x7e
[  200.949087]  [<ffffffff812f8e57>] ? pci_unregister_driver+0x17/0x75
[  200.949177]  [<ffffffffa01cdec9>] ? drm_pci_exit+0x3c/0xa0 [drm]
[  200.949268]  [<ffffffff8107e5c3>] ? SyS_delete_module+0x123/0x199
[  200.949358]  [<ffffffff8171c4b2>] ? page_fault+0x22/0x30
[  200.949447]  [<ffffffff817211a2>] ? system_call_fastpath+0x16/0x1b
[  200.949536] ---[ end trace 32f16d9b1d6381ab ]---


Reproduce steps:
----------------------------
1. ./drv_module_reload
Comment 1 Jani Nikula 2014-03-25 08:26:46 UTC
Please bisect.
Comment 2 Daniel Vetter 2014-03-26 18:44:14 UTC
Is this still an issue?

And I think you need to look harder, this was definitely working on older kernels. If it is still and issue please analyze and provide a bisect.
Comment 3 lu hua 2014-03-27 07:30:33 UTC
The first bad commit:8f6599da8e772fa8de54cdf98e9e03cbaf3946da is the first bad commit.
Same as https://bugs.freedesktop.org/show_bug.cgi?id=71208#c5

commit 8f6599da8e772fa8de54cdf98e9e03cbaf3946da
Author:     David Herrmann <dh.herrmann@gmail.com>
AuthorDate: Sun Oct 20 18:55:45 2013 +0200
Commit:     Dave Airlie <airlied@redhat.com>
CommitDate: Wed Nov 6 14:53:25 2013 +1000

    drm: delay minor destruction to drm_dev_free()

    Instead of freeing minors in drm_dev_unregister(), we only unplug them and
    delay the free to drm_dev_free(). Note that if drm_dev_register() has
    never been called, minors are NULL and this has no effect.

    This change is needed to allow early device unregistration. If we want to
    call drm_dev_unregister() on live devices, we need to guarantee that
    minors are still valid (but unplugged). This way, any open file can still
    access file_priv->minor->dev to get the DRM device. However, the minor is
    unplugged so no new users can occur.

    Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
    Signed-off-by: Dave Airlie <airlied@redhat.com>
Comment 4 Daniel Vetter 2014-03-27 08:30:07 UTC
Tbh I'm a bit confused. Adding David Herrmann for insight ...
Comment 5 David Herrmann 2014-03-27 12:08:50 UTC
I am confused by the bisect. The commit in questions just delays a free() so my first guess was some i915 code checks for "dev->primary != NULL" and then does a deregistration, while it should rather check for "dev->primary->kdev != NULL". On the other hand, no-one should check for that at all and just expect them to be there during ->unload().

But then again, looking at the backtrace, we're currently in the i915 ->unload() path, so the code modified by the commit hasn't even be called, yet. Furthermore, the warning happens _deep_ down the i2c chain (unregistering i2c-devices on top of i2c-adapters).

A few things that bug me:
 * this code is 1/2 a year old, why does this warning show up only _now_?
 * the bisected commit breaks module re-loading, but that was already fixed
 * the code in question hasn't even been called, yet

My suspicion is that you mixed up two different calltraces. Your comment "Same as https://bugs.freedesktop.org/show_bug.cgi?id=71208#c5" is definitely wrong. The stack-traces in that bug are on the devices created by DRM, unlike this trace which is in the i2c layer.

Can you please bisect again _starting_ at least after this:

  commit a3483353ca4e6dbeef2ed62ebed01af109b5b27a
  Author: David Herrmann <dh.herrmann@gmail.com>
  Date:   Wed Nov 13 11:42:26 2013 +0100

      drm: check for !kdev in drm_unplug_minor()

And please verify the stack-traces contain "i2cdev_detach_adapter".
Comment 6 Daniel Vetter 2014-03-27 13:06:59 UTC
Set NEEDINFO otherwise our QA wont take action.
Comment 7 lu hua 2014-03-28 02:49:15 UTC
Bisect it again.
db31af1d4e815e141295b0bdf8da3e77885001d5 is the first bad commit
commit db31af1d4e815e141295b0bdf8da3e77885001d5
Author: Jani Nikula <jani.nikula@intel.com>
Date:   Fri Nov 8 16:48:53 2013 +0200

    drm/i915: clean up backlight conditional build

    I've always felt the backlight device conditional build has been all
    backwards. Make it feel right.

    Gently move things towards connector based stuff while at it.

    There should be no functional changes.

    Signed-off-by: Jani Nikula <jani.nikula@intel.com>
    Reviewed-by: Imre Deak <imre.deak@intel.com>
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Comment 8 Imre Deak 2014-03-28 10:22:42 UTC
(In reply to comment #7)
> Bisect it again.
> db31af1d4e815e141295b0bdf8da3e77885001d5 is the first bad commit
> commit db31af1d4e815e141295b0bdf8da3e77885001d5
> Author: Jani Nikula <jani.nikula@intel.com>
> Date:   Fri Nov 8 16:48:53 2013 +0200
> 
>     drm/i915: clean up backlight conditional build
>

There was a related fix after this commit, so I think it's a more recent regression. Could you do - yet another - bisect between

commit 931c1c26983b4f84e33b78579fc8d57e4a14c6b4
Author: Imre Deak <imre.deak@intel.com>
Date:   Tue Feb 11 17:12:51 2014 +0200

    drm/i915: sdvo: add i2c sysfs symlink to the connector's directory

and current -nightly?
Comment 9 lu hua 2014-03-31 06:32:11 UTC
(In reply to comment #8)
> (In reply to comment #7)
> > Bisect it again.
> > db31af1d4e815e141295b0bdf8da3e77885001d5 is the first bad commit
> > commit db31af1d4e815e141295b0bdf8da3e77885001d5
> > Author: Jani Nikula <jani.nikula@intel.com>
> > Date:   Fri Nov 8 16:48:53 2013 +0200
> > 
> >     drm/i915: clean up backlight conditional build
> >
> 
> There was a related fix after this commit, so I think it's a more recent
> regression. Could you do - yet another - bisect between
> 
> commit 931c1c26983b4f84e33b78579fc8d57e4a14c6b4
> Author: Imre Deak <imre.deak@intel.com>
> Date:   Tue Feb 11 17:12:51 2014 +0200
> 
>     drm/i915: sdvo: add i2c sysfs symlink to the connector's directory
> 
> and current -nightly?

Selected 931c1c26983b4f84e33b78579fc8d57e4a14c6b4 as good commit, many commits  fail with "./drv_module_reload: line 43: /sys/class/vtconsole/vtcon1/bind: No such file or directory", skipped these commits.
There are only 'skip'ped commits left to test.
The first bad commit could be any of:
2eb4c7b1e7f275fe833aabe0a251b8e3f767fb08
3ae471f73a1d581e078b5b06d08d7b82833a093f
fc275a74eb816c12d4fc226344e734872ed0b2f9
2e9a3fc3a360ac180f5b4c3c4416a0d0dec60dd8
6ae668cc19e8b18df28cd67b3448d9abd79284a4
71c68c4fc9bdcd6e46107a0f40b50a523f3b4fe0
7288ca07b638db485abec5752bd6b1faed1c33ef
9e541466eed411cb5462fa9e6181c4d409e7e2ef
Comment 10 Daniel Vetter 2014-04-11 16:29:11 UTC
(In reply to comment #9)
> There are only 'skip'ped commits left to test.
> The first bad commit could be any of:
> 2eb4c7b1e7f275fe833aabe0a251b8e3f767fb08
> 3ae471f73a1d581e078b5b06d08d7b82833a093f
> fc275a74eb816c12d4fc226344e734872ed0b2f9
> 2e9a3fc3a360ac180f5b4c3c4416a0d0dec60dd8
> 6ae668cc19e8b18df28cd67b3448d9abd79284a4
> 71c68c4fc9bdcd6e46107a0f40b50a523f3b4fe0
> 7288ca07b638db485abec5752bd6b1faed1c33ef
> 9e541466eed411cb5462fa9e6181c4d409e7e2ef

All these commits are for drm/i2c/tda998x which we don't use at all for our driver. I suspect something has gone wrong with the bisect, can you please double-check?
Comment 11 lu hua 2014-04-17 07:30:46 UTC
(In reply to comment #10)
> (In reply to comment #9)
> > There are only 'skip'ped commits left to test.
> > The first bad commit could be any of:
> > 2eb4c7b1e7f275fe833aabe0a251b8e3f767fb08
> > 3ae471f73a1d581e078b5b06d08d7b82833a093f
> > fc275a74eb816c12d4fc226344e734872ed0b2f9
> > 2e9a3fc3a360ac180f5b4c3c4416a0d0dec60dd8
> > 6ae668cc19e8b18df28cd67b3448d9abd79284a4
> > 71c68c4fc9bdcd6e46107a0f40b50a523f3b4fe0
> > 7288ca07b638db485abec5752bd6b1faed1c33ef
> > 9e541466eed411cb5462fa9e6181c4d409e7e2ef
> 
> All these commits are for drm/i2c/tda998x which we don't use at all for our
> driver. I suspect something has gone wrong with the bisect, can you please
> double-check?

I will bisect it again.
good commit:b2040f6fed736ccd2319768bc59833abe74148b8
bad commit:33688d95c458ffca6b247189cc6f15277fd6abf0
Comment 12 lu hua 2014-04-17 07:51:44 UTC
Bisect shows: 1c61eae469e0d1d2fb9d7b77f51ca50c1f8f3ce9 is the first bad commit
commit 1c61eae469e0d1d2fb9d7b77f51ca50c1f8f3ce9
Author: Christian König <christian.koenig@amd.com>
Date:   Tue Feb 18 01:50:22 2014 -0700

    drm/radeon: fix CP semaphores on CIK

    The CP semaphore queue on CIK has a bug that triggers if uncompleted
    waits use the same address while a signal is still pending. Work around
    this by using different addresses for each sync.

    Signed-off-by: Christian König <christian.koenig@amd.com>
    Cc: stable@vger.kernel.org
Comment 13 Jani Nikula 2014-04-17 16:14:45 UTC
(In reply to comment #12)
> Bisect shows: 1c61eae469e0d1d2fb9d7b77f51ca50c1f8f3ce9 is the first bad
> commit
> commit 1c61eae469e0d1d2fb9d7b77f51ca50c1f8f3ce9
> Author: Christian König <christian.koenig@amd.com>
> Date:   Tue Feb 18 01:50:22 2014 -0700
> 
>     drm/radeon: fix CP semaphores on CIK

I don't think there's any way this could be the culprit. Due to multiple different bisect results, I suspect the issue you're seeing occurs sometimes, but not always, so you can't rely on one good test result only for bisection.
Comment 14 lu hua 2014-04-18 07:15:47 UTC
It passes 5 in 5 runs on commit:b2040f6fed7
It fails 5 in 5 runs on commit:33688d95c45
Comment 15 Jani Nikula 2014-04-22 11:41:20 UTC
(In reply to comment #14)
> It passes 5 in 5 runs on commit:b2040f6fed7
> It fails 5 in 5 runs on commit:33688d95c45

$ git log --oneline b2040f6fed7..33688d95c45 | wc -l
339

Please bisect into these two.
Comment 16 lu hua 2014-04-25 06:42:03 UTC
(In reply to comment #15)
> (In reply to comment #14)
> > It passes 5 in 5 runs on commit:b2040f6fed7
> > It fails 5 in 5 runs on commit:33688d95c45
> 
> $ git log --oneline b2040f6fed7..33688d95c45 | wc -l
> 339
> 
> Please bisect into these two.

Comment 12's bisect result is between these 2 commits
Comment 17 Jani Nikula 2014-04-25 09:02:36 UTC
(In reply to comment #16)
> (In reply to comment #15)
> > (In reply to comment #14)
> > > It passes 5 in 5 runs on commit:b2040f6fed7
> > > It fails 5 in 5 runs on commit:33688d95c45
> > 
> > $ git log --oneline b2040f6fed7..33688d95c45 | wc -l
> > 339
> > 
> > Please bisect into these two.
> 
> Comment 12's bisect result is between these 2 commits

Maybe, but it's a change in Radeon code, not our code. I don't believe the result is correct.
Comment 18 lu hua 2014-04-29 07:59:56 UTC
(In reply to comment #14)
> It passes 5 in 5 runs on commit:b2040f6fed7
> It fails 5 in 5 runs on commit:33688d95c45

Retest on commit b2040f6fed7, it also causes call trace.
Comment 19 lu hua 2014-04-30 08:00:24 UTC
Re-bisect it,24b9bf43e93e0edd89072da51cf1fab95fc69dec is the first bad commit
commit 24b9bf43e93e0edd89072da51cf1fab95fc69dec
Author: Nikolay Aleksandrov <nikolay@redhat.com>
Date:   Mon Mar 3 23:19:18 2014 +0100

    net: fix for a race condition in the inet frag code

    I stumbled upon this very serious bug while hunting for another one,
    it's a very subtle race condition between inet_frag_evictor,
    inet_frag_intern and the IPv4/6 frag_queue and expire functions
    (basically the users of inet_frag_kill/inet_frag_put).

    What happens is that after a fragment has been added to the hash chain
    but before it's been added to the lru_list (inet_frag_lru_add) in
    inet_frag_intern, it may get deleted (either by an expired timer if
    the system load is high or the timer sufficiently low, or by the
    fraq_queue function for different reasons) before it's added to the
    lru_list, then after it gets added it's a matter of time for the
    evictor to get to a piece of memory which has been freed leading to a
    number of different bugs depending on what's left there.

    I've been able to trigger this on both IPv4 and IPv6 (which is normal
    as the frag code is the same), but it's been much more difficult to
    trigger on IPv4 due to the protocol differences about how fragments
    are treated.

Revert this commit, new warning and call trace appears:
[    1.357371] ------------[ cut here ]------------
[    1.357376] WARNING: CPU: 0 PID: 1230 at drivers/gpu/drm/drm_modes.c:119 drm_mode_probed_add+0x27/0x41 [drm]()
[    1.357376] Modules linked in: firewire_ohci(+) firewire_core crc_itu_t i915(+) video drm_kms_helper drm floppy button
[    1.357381] CPU: 0 PID: 1230 Comm: systemd-udevd Tainted: G        W    3.14.0-rc7_queued_revert_24b9bf43e_20140429+ #1
[    1.357382] Hardware name: Gigabyte Technology Co., Ltd. H55M-UD2H/H55M-UD2H, BIOS F4 12/02/2009
[    1.357383]  0000000000000000 0000000000000009 ffffffff81716de3 0000000000000000
[    1.357385]  ffffffff81035052 ffff88003734f000 ffffffffa0029754 0000000000004ba5
[    1.357386]  ffff880111359300 ffff8800d368ec00 ffff8800d35a1500 0000000000004ba5
[    1.357388] Call Trace:
[    1.357390]  [<ffffffff81716de3>] ? dump_stack+0x41/0x51
[    1.357392]  [<ffffffff81035052>] ? warn_slowpath_common+0x73/0x8b
[    1.357396]  [<ffffffffa0029754>] ? drm_mode_probed_add+0x27/0x41 [drm]
[    1.357400]  [<ffffffffa0029754>] ? drm_mode_probed_add+0x27/0x41 [drm]
[    1.357403]  [<ffffffffa002c45f>] ? drm_add_edid_modes+0x2d6/0xd02 [drm]
[    1.357408]  [<ffffffffa002575f>] ? drm_mode_object_get+0x51/0x60 [drm]
[    1.357423]  [<ffffffffa00b7790>] ? intel_connector_update_modes+0x1c/0x36 [i915]
[    1.357425]  [<ffffffff8171b390>] ? mutex_lock+0x9/0x25
[    1.357441]  [<ffffffffa00c0b1c>] ? intel_crt_ddc_get_modes+0x21/0x3c [i915]
[    1.357458]  [<ffffffffa00c0b7d>] ? intel_crt_get_modes+0x46/0x8a [i915]
[    1.357471]  [<ffffffffa005f5b3>] ? drm_helper_probe_single_connector_modes+0x138/0x2d2 [drm_kms_helper]
[    1.357475]  [<ffffffffa0060318>] ? drm_fb_helper_probe_connector_modes+0x38/0x4c [drm_kms_helper]
[    1.357477]  [<ffffffffa006127c>] ? drm_fb_helper_initial_config+0x1ab/0x450 [drm_kms_helper]
[    1.357480]  [<ffffffff810d9b6d>] ? kmem_cache_alloc+0x23/0xac
[    1.357497]  [<ffffffffa00a1374>] ? gen5_write32+0x21/0x47 [i915]
[    1.357514]  [<ffffffffa0096bb7>] ? ibx_display_interrupt_update+0x91/0xb4 [i915]
[    1.357531]  [<ffffffffa00a1374>] ? gen5_write32+0x21/0x47 [i915]
[    1.357553]  [<ffffffffa00d8865>] ? i915_driver_load+0xbad/0xe1e [i915]
[    1.357560]  [<ffffffffa002049f>] ? drm_dev_register+0x74/0xe7 [drm]
[    1.357565]  [<ffffffffa0022729>] ? drm_get_pci_dev+0xff/0x1bc [drm]
[    1.357567]  [<ffffffff81384e55>] ? __pm_runtime_resume+0x5b/0x6a
[    1.357569]  [<ffffffff812f8bc9>] ? local_pci_probe+0x35/0x7a
[    1.357572]  [<ffffffff8137c904>] ? driver_probe_device+0x1b3/0x1b3
[    1.357574]  [<ffffffff812f8e6c>] ? pci_device_probe+0xcc/0xf0
[    1.357576]  [<ffffffff8137c7e3>] ? driver_probe_device+0x92/0x1b3
[    1.357578]  [<ffffffff8137c957>] ? __driver_attach+0x53/0x73
[    1.357580]  [<ffffffff8137b0be>] ? bus_for_each_dev+0x4e/0x7f
[    1.357582]  [<ffffffff8137c065>] ? bus_add_driver+0xe2/0x1c7
[    1.357585]  [<ffffffff8137ce9a>] ? driver_register+0x82/0xb5
[    1.357587]  [<ffffffffa010e000>] ? 0xffffffffa010dfff
[    1.357589]  [<ffffffff81000296>] ? do_one_initcall+0x78/0xfa
[    1.357591]  [<ffffffff8104e4af>] ? __blocking_notifier_call_chain+0x4f/0x5d
[    1.357594]  [<ffffffff8107fe72>] ? load_module+0x1745/0x1a13
[    1.357596]  [<ffffffff8107da98>] ? mod_kobject_put+0x42/0x42
[    1.357599]  [<ffffffff81080229>] ? SyS_finit_module+0x4e/0x62
[    1.357602]  [<ffffffff817214a2>] ? system_call_fastpath+0x16/0x1b
[    1.357603] ---[ end trace e75cbd96bfbd4fea ]---
[    1.357605] ------------[ cut here ]------------
Comment 20 Chris Wilson 2014-04-30 10:22:56 UTC
This bug has confused several different WARNs. Lets start afresh.
Comment 21 Elizabeth 2017-10-06 14:39:04 UTC
Closing old verified.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.