Bug 103731

Summary: [DRM/i915][Bisected] Kernel panic in setup_vector_irq() with HDMI LPE driver while putting offline CPU thread back to online
Product: DRI Reporter: Augustine Chen <augustine.chen>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED WONTFIX QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: critical    
Priority: highest CC: intel-gfx-bugs, matthew.d.roper, ray.hsu, tiwai
Version: XOrg gitKeywords: bisected
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: BSW/CHT i915 features: display/audio
Attachments:
Description Flags
Full dmesg none

Description Augustine Chen 2017-11-14 08:01:08 UTC
Created attachment 135445 [details]
Full dmesg

System Environment:
--------------------------
 -- Platform: CHT
 -- System architecture: x86_64
 -- Kernel version: 4.14.0-041400-generic
 -- Linux distribution: Ubuntu 16.04 LTS
 -- Mother board model: CHT T3 RVP with CHT-T3 D1 SOC (both T3 and T4 RVPs are reproducible)
 -- Display connector: HDMI


Bug detailed description:
-----------------------------
While doing CPU thread online/offline test, the system hanging could be observed when putting offline thread back to online. This issue is reproducible since Kernel 4.11-rc1 and is still present in Kernel 4.14. This issue had been bisected to happen since this commit eef57324d926f0d8c7a40069e7d26e0cb0651b47. And after digging further, Kernel panic in setup_vector_irq() with debug messages as below could be captured.

 ------------[ cut here ]------------
[   87.353072] irq 298 idata->chip->name hdmi_lpe_audio_irqchip
[   87.353072] irq 298 apic_chip_data
[   87.353073] irq 298 data->domain is NULL
[   87.353120] BUG: unable to handle kernel NULL pointer dereference at (null)
[   87.353132] IP: setup_vector_irq+0x1ba/0x230
[   87.353133] PGD 0


Reproduce steps:
----------------------------
1. check cpuinfo $ grep "processor" /proc/cpuinfo
2. disable cpu[1-3] $ echo 0 > /sys/devices/system/cpu/cpu[1-3]/online
3. enable cpu[1-3] $ echo 1 > /sys/devices/system/cpu/cpu[1-3]/online <-- issue happens, system hangs.


First bad commit:
----------------------------
commit eef57324d926f0d8c7a40069e7d26e0cb0651b47
Comment 1 Chris Wilson 2017-11-14 11:02:53 UTC
For reference, the bisect result is

commit eef57324d926f0d8c7a40069e7d26e0cb0651b47
Author: Jerome Anand <jerome.anand@intel.com>
Date:   Wed Jan 25 04:27:49 2017 +0530

    drm/i915: setup bridge for HDMI LPE audio driver
    
    Enable support for HDMI LPE audio mode on Baytrail and
    Cherrytrail when HDaudio controller is not detected
    
    Setup minimum required resources during i915_driver_load:
    1. Create a platform device to share MMIO/IRQ resources
    2. Make the platform device child of i915 device for runtime PM.
    3. Create IRQ chip to forward HDMI LPE audio irqs.
    
    HDMI LPE audio driver (a standalone sound driver) probes the
    LPE audio device and creates a new sound card.
    
    Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
    Signed-off-by: Jerome Anand <jerome.anand@intel.com>
    Acked-by: Jani Nikula <jani.nikula@intel.com>
    Signed-off-by: Takashi Iwai <tiwai@suse.de>
Comment 2 Pierre Bossart 2017-11-20 21:39:23 UTC
I can reproduce the system hang on a Zotac PI330 device (CHT)

Can you clarify which debug options you used, I just see a system hang when one core is put online again and can't get a dmesg as detailed as yours.

FWIW, when using a regular 4.12 Fedora install I also see the problem but I get an additional message when taking core 1 out:

[  163.799497] Cannot set affinity for irq 158
[  163.801408] smpboot: CPU 1 is now offline
Comment 3 Takashi Iwai 2017-11-27 16:36:30 UTC
It smells more like a bug in x86 CPU hotplug side to me.
Can anyone check whether it's reproducible with 4.15-rc1?  There has been quite lots of fixes / cleanups in x86 code in this regard.
Comment 4 Augustine Chen 2017-11-29 08:41:22 UTC
Two findings here...
The first is there is no well-recognized chip_data of HDMI LPE audio IRQ which makes x86 APIC driver try to handle a invalid pointer and then causes kernel panic.
The second is to disable CONFIG_CPUMASK_OFFSTACK to make cpumask_var_t be an array type. Then nothing about invalid pointer in x86 APIC driver will happen.
Comment 5 Augustine Chen 2017-12-18 02:53:47 UTC
This issue cannot be reproduced in v4.15-rc1 since the code of making improper reference to invalid pointer in setup_vector_irq() was patched. There is no need to modify this driver now after discussion.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.