Bug 94736

Summary: [BAT IGT ALL] drv_module_reload_basic fail on all platforms
Product: DRI Reporter: Daniela Prodan <daniela.doras-prodan>
Component: DRM/IntelAssignee: Libin Yang <libin.yang>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: critical    
Priority: medium CC: intel-gfx-bugs, marius.c.vlad, tiwai
Version: DRI git   
Hardware: All   
OS: All   
Whiteboard:
i915 platform: ALL i915 features:
Attachments:
Description Flags
dmesg log
none
drm-intel-nightly commit
none
Fix patch
none
Debug patch
none
Debug messages for patch
none
Fix patch #2 none

Description Daniela Prodan 2016-03-29 09:59:57 UTC
Created attachment 122607 [details]
dmesg log

Due to sound patches, the drv_module_reload_basic is failing on all platforms:

drm-intel-nightly: 2016y-03m-29d-08h-51m-45s UTC integration manifest


[  112.715554] ------------[ cut here ]------------
[  112.715561] WARNING: CPU: 3 PID: 5306 at sound/hda/hdac_i915.c:309 snd_hdac_i915_init+0x12a/0x170 [snd_hda_core]
[  112.715562] Modules linked in: snd_hda_intel(+) i915 snd_hda_codec_hdmi x86_pkg_temp_thermal snd_hda_codec_realtek snd_hda_codec_generic intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm mei_me mei i2c_hid e1000e ptp pps_core sdhci_pci sdhci mmc_core [last unloaded: i915]
[  112.715577] CPU: 3 PID: 5306 Comm: kworker/3:18 Tainted: G     U          4.6.0-rc1-gfxbench+ #1
[  112.715578] Hardware name:                  /NUC6i5SYB, BIOS SYSKLi35.86A.0036.2016.0223.1140 02/23/2016
[  112.715582] Workqueue: events azx_probe_work [snd_hda_intel]
[  112.715583]  0000000000000000 ffff880261047ca8 ffffffff81404d25 0000000000000000
[  112.715586]  0000000000000000 ffff880261047ce8 ffffffff81079b7c 0000013500000000
[  112.715588]  ffff880036c633f8 00000000fffffff0 ffff880036c633f8 ffff88026441b490
[  112.715590] Call Trace:
[  112.715594]  [<ffffffff81404d25>] dump_stack+0x67/0x92
[  112.715596]  [<ffffffff81079b7c>] __warn+0xcc/0xf0
[  112.715598]  [<ffffffff81079c58>] warn_slowpath_null+0x18/0x20
[  112.715601]  [<ffffffffa00ef67a>] snd_hdac_i915_init+0x12a/0x170 [snd_hda_core]
[  112.715604]  [<ffffffffa001f20c>] azx_probe_work+0x6c/0x870 [snd_hda_intel]
[  112.715606]  [<ffffffff810cbc99>] ? __lock_is_held+0x49/0x70
[  112.715609]  [<ffffffff810962fb>] process_one_work+0x1cb/0x680
[  112.715610]  [<ffffffff81096276>] ? process_one_work+0x146/0x680
[  112.715612]  [<ffffffff810967f9>] worker_thread+0x49/0x490
[  112.715614]  [<ffffffff810967b0>] ? process_one_work+0x680/0x680
[  112.715615]  [<ffffffff8109ceda>] kthread+0xea/0x100
[  112.715618]  [<ffffffff817d0c87>] ? _raw_spin_unlock_irq+0x27/0x50
[  112.715620]  [<ffffffff817d1672>] ret_from_fork+0x22/0x40
[  112.715621]  [<ffffffff8109cdf0>] ? kthread_create_on_node+0x1f0/0x1f0
[  112.715622] ---[ end trace d0e1df257615cbc1 ]---

I have attached dmesg and commits.log
Comment 1 Daniela Prodan 2016-03-29 10:01:11 UTC
Created attachment 122608 [details]
drm-intel-nightly commit
Comment 2 Takashi Iwai 2016-03-29 10:33:48 UTC
Thanks, it's just a spurious WARN_ON() for a leftover pointer at module exit, and nothing serious.  The patch below fixes it.
Comment 3 Takashi Iwai 2016-03-29 10:34:03 UTC
Created attachment 122609 [details] [review]
Fix patch
Comment 4 Marius Vlad 2016-03-29 13:19:02 UTC
Same warning:


[   40.798598] ------------[ cut here ]------------
[   40.798606] WARNING: CPU: 1 PID: 450 at sound/hda/hdac_i915.c:309 snd_hdac_i915_init+0x130/0x169 [snd_hda_core]
[   40.798608] Modules linked in: snd_hda_intel(+) i915 drm_kms_helper drm snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_codec intel_rapl snd_hda_core x86_pkg_temp_thermal dell_smm_hwmon intel_powerclamp tpm_tis tpm joydev coretemp snd_hwdep ghash_clmulni_intel serio_raw snd_pcm snd_timer lpc_ich mei_me mei processor battery evdev netconsole hid_generic usbhid hid xhci_pci xhci_hcd psmouse ehci_pci ehci_hcd r8169 i2c_algo_bit mii usbcore usb_common fan thermal video button [last unloaded: drm]
[   40.798624] CPU: 1 PID: 450 Comm: kworker/1:3 Tainted: G     U  W       4.6.0-rc1+ #89
[   40.798625] Hardware name: Dell Inc. Inspiron 3847/088DT1       , BIOS A06 01/15/2015
[   40.798628] Workqueue: events azx_probe_work [snd_hda_intel]
[   40.798629]  0000000000000286 00000000ffe976ea ffffffff814b8ca5 0000000000000000
[   40.798630]  0000000000000000 ffffffff810f8ebe ffff88002f29f708 ffff8802058ae600
[   40.798632]  ffff88021fa96240 ffff880215968098 ffff880215968000 ffff88002f29f000
[   40.798633] Call Trace:
[   40.798637]  [<ffffffff814b8ca5>] ? dump_stack+0x5c/0x77
[   40.798640]  [<ffffffff810f8ebe>] ? __warn+0xbe/0xe0
[   40.798644]  [<ffffffffa03d49e0>] ? snd_hdac_i915_init+0x130/0x169 [snd_hda_core]
[   40.798646]  [<ffffffff8112d127>] ? dequeue_task_fair+0x57/0x940
[   40.798649]  [<ffffffffa0116642>] ? azx_probe_continue+0x2e2/0x850 [snd_hda_intel]
[   40.798652]  [<ffffffff8111170a>] ? process_one_work+0x15a/0x410
[   40.798653]  [<ffffffff81111a0d>] ? worker_thread+0x4d/0x480
[   40.798654]  [<ffffffff811119c0>] ? process_one_work+0x410/0x410
[   40.798656]  [<ffffffff8111783d>] ? kthread+0xcd/0xf0
[   40.798659]  [<ffffffff81791ca2>] ? ret_from_fork+0x22/0x40
[   40.798661]  [<ffffffff81117770>] ? kthread_create_on_node+0x190/0x190
[   40.798662] ---[ end trace 368a576a7cb20cec ]---


Tried with:

commit f6216961932e1c64251ddb6a6ba17e4f92596397
Author: Takashi Iwai <tiwai@suse.de>
Date:   Tue Mar 29 12:29:24 2016 +0200

    ALSA: hda - Clear the leftover component assignment at snd_hdac_i915_exit()
    
    The commit [d745f5e7b8b2: ALSA: hda - Add the pin / port mapping on
    Intel ILK and VLV] introduced a WARN_ON() to check the pointer for
    avoiding the double initializations.  But hdac_acomp pointer wasn't
    cleared at snd_hdac_i915_exit(), thus after reloading the HD-audio
    driver, it may result in the false positive warning.  This patch makes
    sure to clear the leftover pointer at exit.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=94736
    Reported-by: Daniela Doras-prodan <daniela.doras-prodan@intel.com>
    Signed-off-by: Takashi Iwai <tiwai@suse.de>

commit 2702045fbd4188c1c26bd890bc43976fd10937ad
Author: Jani Nikula <jani.nikula@intel.com>
Date:   Tue Mar 29 15:21:58 2016 +0300

    drm-intel-nightly: 2016y-03m-29d-12h-21m-19s UTC integration manifest
Comment 5 Takashi Iwai 2016-03-29 13:25:20 UTC
(In reply to Marius Vlad from comment #4)
> Same warning:

Which hardware?  The behavior pretty much depends on the model.
Comment 6 Marius Vlad 2016-03-29 13:56:15 UTC
HSW, i5-4460, I'll try on other platforms as well.
Comment 7 Marius Vlad 2016-03-29 13:58:17 UTC
(In reply to Marius Vlad from comment #6)
> HSW, i5-4460, I'll try on other platforms as well.

The initial bug report is on all platforms from ILK to SKL.
(ILK, SNB, IVB, BYT, HSW, BDW, SLK).
Comment 8 Marius Vlad 2016-03-29 14:04:04 UTC
Was curios and indeed d745f5e7b8b2961f68b0b9093a0f914a8a83c2ae seems to trigger those warns.
Comment 9 Takashi Iwai 2016-03-29 14:08:44 UTC
Yes, the commit added WARN_ON() at probe, but forgot to clear it at exit.
And today's sound.git tree got the support of i915 binding for ILK and later models.  That's why you hit on all ILK+ machines (but some machines like ILK or IVY should show a different stack trace).

But I wonder why you still get the warning with the latest fix.  It works on my several machines.

Could you give the simplest reproducer to trigger WARN_ON() even with the fix patch?
Comment 10 Marius Vlad 2016-03-29 14:30:30 UTC
From i-g-t:

tests/drv_module_reload, or the following script:

        # we must kick away fbcon (but only fbcon)
        for vtcon in /sys/class/vtconsole/vtcon*/ ; do
                if grep "frame buffer device" $vtcon/name > /dev/null ; then
                        echo unbinding $vtcon: `cat $vtcon/name`
                        echo 0 > $vtcon/bind
                fi
        done

        # The sound driver uses our power well
        pkill alsactl
        rmmod snd_hda_intel &> /dev/null

        #ignore errors in ips - gen5 only
        rmmod intel_ips &> /dev/null
        rmmod i915 || return $IGT_EXIT_SKIP
        #ignore errors in intel-gtt, often built-in
        rmmod intel-gtt &> /dev/null
        # drm may be used by other devices (nouveau, radeon, udl, etc)
        rmmod drm_kms_helper &> /dev/null
        rmmod drm &> /dev/null

        if lsmod | grep i915 &> /dev/null ; then
                echo WARNING: i915.ko still loaded!
                return $IGT_EXIT_FAILURE
        else
                echo module successfully unloaded
        fi

        modprobe i915 $*

        if [ -f /sys/class/vtconsole/vtcon1/bind ]; then
                echo 1 > /sys/class/vtconsole/vtcon1/bind
        fi

        modprobe snd_hda_intel
Comment 11 Takashi Iwai 2016-03-29 14:59:29 UTC
The test works fine on my HSW machine.  Could you double-check whether it still happens with the patch in comment 3?
Comment 12 Marius Vlad 2016-03-29 15:35:42 UTC
I can reproduce here with http://paste.debian.net/422479/

It is a variant of tests/drv_module_reload_basic but w/o i-g-t dependencies. 

Might have more success with that.
Comment 13 Takashi Iwai 2016-03-29 15:45:04 UTC
(In reply to Marius Vlad from comment #12)
> I can reproduce here with http://paste.debian.net/422479/
> 
> It is a variant of tests/drv_module_reload_basic but w/o i-g-t dependencies. 

Hmm, the script still works here on my HSW machine.

Is your kernel really with the patch applied?  If so, please apply the debug patch below in addition, and give the whole kernel message after the reload test.
Comment 14 Takashi Iwai 2016-03-29 15:45:31 UTC
Created attachment 122612 [details] [review]
Debug patch
Comment 15 Marius Vlad 2016-03-29 16:10:23 UTC
Created attachment 122613 [details]
Debug messages for patch

Here's the output you requested. Unbinding doesn't happen at subsequent runs (more than once seems here to be the problem).
Comment 16 Takashi Iwai 2016-03-29 16:54:44 UTC
Thanks, now it's clear that it's the error path that didn't clear the pointer.  The fix is below.

At the next time, please give the whole kernel message from the beginning.  Sometimes it leads to a misjudge without the proper information like this case.
Comment 17 Takashi Iwai 2016-03-29 16:55:26 UTC
Created attachment 122618 [details] [review]
Fix patch #2
Comment 18 Takashi Iwai 2016-03-29 16:57:24 UTC
BTW, the secondary error path was triggered basically by a kmod bug.  That's why I couldn't reproduce it here.  It's already fixed in the kmod upstream, as it seems.
Comment 19 Marius Vlad 2016-03-29 17:08:09 UTC
My bad won't happen again. 

Tested with both patches (_init and _exit) and I see no warnings now:

commit 50bd9fb15ccf6f013131f334cd604d783c0c3186
Author: Takashi Iwai <tiwai@suse.de>
Date:   Tue Mar 29 18:48:07 2016 +0200

    ALSA: hda - Fix yet another i915 pointer leftover in error path
    
    The hdac_acomp object in hdac_i915.c is left as assigned even after
    binding with i915 actually fails, and this leads to the WARN_ON() at
    the next load of the module.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=94736
    Signed-off-by: Takashi Iwai <tiwai@suse.de>

commit f07183a706c41c8aabef9cd711e45a4ca367aeea
Author: Takashi Iwai <tiwai@suse.de>
Date:   Tue Mar 29 12:29:24 2016 +0200

    ALSA: hda - Clear the leftover component assignment at snd_hdac_i915_exit()
    
    The commit [d745f5e7b8b2: ALSA: hda - Add the pin / port mapping on
    Intel ILK and VLV] introduced a WARN_ON() to check the pointer for
    avoiding the double initializations.  But hdac_acomp pointer wasn't
    cleared at snd_hdac_i915_exit(), thus after reloading the HD-audio
    driver, it may result in the false positive warning.  This patch makes
    sure to clear the leftover pointer at exit.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=94736
    Reported-by: Daniela Doras-prodan <daniela.doras-prodan@intel.com>
    Signed-off-by: Takashi Iwai <tiwai@suse.de>
Comment 20 Marius Vlad 2016-03-31 13:19:10 UTC
This is fixed in nightly (03c0f854e93263563f559d2bc8e47fb51adae697).
Comment 21 Jari Tahvanainen 2016-09-30 09:35:25 UTC
Closing.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.