Bug 104980 - NULL pointer in drm_dp_mst_wait_tx_reply / hotplugging via DP MST hub causes oops
Summary: NULL pointer in drm_dp_mst_wait_tx_reply / hotplugging via DP MST hub causes ...
Status: CLOSED WORKSFORME
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium major
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2018-02-06 23:17 UTC by Adam Nielsen
Modified: 2018-05-27 00:35 UTC (History)
1 user (show)

See Also:
i915 platform: BDW
i915 features: display/DP MST


Attachments
drm.debug=14 dmesg (379.99 KB, text/plain)
2018-02-14 22:17 UTC, Adam Nielsen
no flags Details

Description Adam Nielsen 2018-02-06 23:17:43 UTC
Steps to reproduce:

1. Connect DisplayPort MST hub to Intel NUC
2. Connect DisplayPort monitors to MST hub
3. Activate displays
4. Remove displays (power cycling them is good enough, but removing and reconnecting the DisplayPort cable also seems to work)
5. When displays are powered on again/reconnected, there is no signal, but any non-MST-connected monitors are still usable
6. Power cycling the displays a second time causes a kernel oops
7. MST monitors still have no signal, non-MST monitors freeze (show a picture but no updates, mouse cursor doesn't move, etc.)
8. SSHing into the machine is possible, however rebooting or shutting down the machine never finishes, it must be power cycled.

This can be reproduced 100% of the time.  Note that power cycling means off at the mains, using the monitors' soft-power buttons doesn't seem to be a problem.

Upgraded to kernel 4.14.14 but still have the issue.  System is an Intel NUC5i3RYK.  Have only tested with Lenovo LT2452p monitors.

Please advise if you need any further info.  I am assuming that if you have access to a DisplayPort MST hub then you will be able to reproduce the issue pretty easily by experimenting with hotplugging an active DisplayPort monitor.

Looks like it's a failure querying the EDID info from the monitor?

Here is dmesg after the failure:

[  547.671668] BUG: unable to handle kernel NULL pointer dereference at 0000000000000320
[  547.671682] IP: mutex_lock+0x10/0x20
[  547.671684] PGD 0 P4D 0
[  547.671689] Oops: 0002 [#1] PREEMPT SMP PTI
[  547.671692] Modules linked in: cmac md4 xt_nat xt_tcpudp veth xfs ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c crc32c_generic loop dm_mod nls_utf8 cifs ccm dns_resolver fscache arc4 nct6775 hwmon_vid iwlmvm snd_hda_codec_hdmi intel_rapl x86_pkg_temp_thermal intel_powerclamp mac80211 coretemp nls_iso8859_1 nls_cp437 vfat fat kvm_intel iwlwifi i915 kvm irqbypass iTCO_wdt iTCO_vendor_support i2c_algo_bit drm_kms_helper crct10dif_pclmul evdev crc32_pclmul ghash_clmulni_intel mac_hid pcbc snd_hda_codec_realtek drm cfg80211 snd_hda_codec_generic aesni_intel          
[  547.671754]  aes_x86_64 e1000e crypto_simd snd_hda_intel glue_helper cryptd intel_cstate intel_rapl_perf mei_me snd_soc_ssm4567 snd_hda_codec pcspkr snd_soc_rt5640 intel_gtt snd_soc_rl6231 agpgart i2c_i801 tpm_tis shpchp lpc_ich syscopyarea ptp ir_rc6_decoder mei tpm_tis_core sysfillrect snd_hda_core pps_core sysimgblt fb_sys_fops thermal fan tpm btusb snd_hwdep rc_rc6_mce snd_soc_core btrtl ir_lirc_codec lirc_dev btbcm btintel nuvoton_cir battery snd_compress rc_core snd_pcm_dmaengine snd_soc_sst_acpi snd_pcm video snd_soc_sst_match elan_i2c bluetooth snd_timer i2c_hid acpi_als 8250_dw snd kfifo_buf button soundcore industrialio ecdh_generic hid rfkill ac97_bus spi_pxa2xx_platform acpi_pad nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc sch_fq_codel ip_tables x_tables ext4 crc16 mbcache              
[  547.671824]  jbd2 fscrypto sd_mod ahci libahci ehci_pci xhci_pci ehci_hcd libata xhci_hcd crc32c_intel scsi_mod usbcore usb_common sdhci_acpi sdhci serio led_class mmc_core         
[  547.671845] CPU: 1 PID: 475 Comm: Xorg Not tainted 4.14.14-1-ARCH #1
[  547.671847] Hardware name:                  /NUC5i3RYB, BIOS RYBDWi35.86A.0361.2016.1202.1005 12/02/2016
[  547.671849] task: ffff9bcb7fa16740 task.stack: ffffaacfc178c000
[  547.671854] RIP: 0010:mutex_lock+0x10/0x20
[  547.671856] RSP: 0018:ffffaacfc178f8e8 EFLAGS: 00010246
[  547.671859] RAX: 0000000000000000 RBX: 00000000000004ac RCX: ffffaacfc2cffdd8
[  547.671862] RDX: ffff9bcb7fa16740 RSI: 0000000000000287 RDI: 0000000000000320
[  547.671864] RBP: ffff9bcb684ecc00 R08: ffff9bcb96c18d90 R09: 00000000000001ed
[  547.671866] R10: ffffaacfc178f8f0 R11: 00000000000000d4 R12: ffff9bcb729b48a0
[  547.671869] R13: ffff9bcb8e4afa40 R14: 00000000000004ac R15: ffffaacfc178fab0
[  547.671872] FS:  00007fc12387c940(0000) GS:ffff9bcb96c80000(0000) knlGS:0000000000000000
[  547.671875] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  547.671877] CR2: 0000000000000320 CR3: 0000000141dd6004 CR4: 00000000003606e0
[  547.671880] Call Trace:
[  547.671896]  drm_dp_mst_wait_tx_reply+0x140/0x1e0 [drm_kms_helper]
[  547.671903]  ? wait_woken+0x80/0x80
[  547.671912]  drm_dp_mst_i2c_xfer+0x1a0/0x260 [drm_kms_helper]
[  547.671918]  __i2c_transfer+0x120/0x430
[  547.671922]  i2c_transfer+0x51/0xd0
[  547.671944]  drm_do_probe_ddc_edid+0xbc/0x140 [drm]
[  547.671960]  ? drm_rgb_quant_range_selectable+0x100/0x100 [drm]
[  547.671974]  ? drm_do_get_edid+0x61/0x2c0 [drm]
[  547.671986]  ? drm_rgb_quant_range_selectable+0x100/0x100 [drm]
[  547.671998]  drm_do_get_edid+0x61/0x2c0 [drm]
[  547.672011]  drm_get_edid+0x52/0x3d0 [drm]
[  547.672021]  drm_dp_mst_get_edid+0x68/0x80 [drm_kms_helper]
[  547.672066]  intel_dp_mst_get_modes+0x29/0x50 [i915]
[  547.672079]  drm_helper_probe_single_connector_modes+0x5b0/0x770 [drm_kms_helper]
[  547.672095]  drm_mode_getconnector+0x156/0x320 [drm]
[  547.672111]  ? drm_mode_connector_property_set_ioctl+0x60/0x60 [drm]
[  547.672124]  drm_ioctl_kernel+0x5b/0xb0 [drm]
[  547.672137]  drm_ioctl+0x2d5/0x370 [drm]
[  547.672150]  ? drm_mode_connector_property_set_ioctl+0x60/0x60 [drm]
[  547.672156]  do_vfs_ioctl+0xa4/0x630
[  547.672161]  ? __sys_recvmsg+0x4e/0x90
[  547.672164]  ? __sys_recvmsg+0x7d/0x90
[  547.672168]  SyS_ioctl+0x74/0x80
[  547.672173]  entry_SYSCALL_64_fastpath+0x20/0x83
[  547.672176] RIP: 0033:0x7fc121122d27
[  547.672178] RSP: 002b:00007ffc9fc5c828 EFLAGS: 00000246
[  547.672181] Code: 17 a0 ff 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 be 02 00 00 00 e9 e1 fa ff ff 90 0f 1f 44 00 00 65 48 8b 14 25 00 5c 01 00 31 c0 <f0> 48 0f b1 17 48 85 c0 75 02 f3 c3 eb d2 66 90 0f 1f 44 00 00 
[  547.672233] RIP: mutex_lock+0x10/0x20 RSP: ffffaacfc178f8e8
[  547.672234] CR2: 0000000000000320
[  547.672237] ---[ end trace 1f8e5b72c7c997de ]---
[  547.922211] BUG: unable to handle kernel NULL pointer dereference at 00000000000003a8
[  547.922223] IP: queue_work_on+0x17/0x40
[  547.922225] PGD 0 P4D 0
[  547.922230] Oops: 0002 [#2] PREEMPT SMP PTI
[  547.922233] Modules linked in: cmac md4 xt_nat xt_tcpudp veth xfs ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c crc32c_generic loop dm_mod nls_utf8 cifs ccm dns_resolver fscache arc4 nct6775 hwmon_vid iwlmvm snd_hda_codec_hdmi intel_rapl x86_pkg_temp_thermal intel_powerclamp mac80211 coretemp nls_iso8859_1 nls_cp437 vfat fat kvm_intel iwlwifi i915 kvm irqbypass iTCO_wdt iTCO_vendor_support i2c_algo_bit drm_kms_helper crct10dif_pclmul evdev crc32_pclmul ghash_clmulni_intel mac_hid pcbc snd_hda_codec_realtek drm cfg80211 snd_hda_codec_generic aesni_intel
[  547.922297]  aes_x86_64 e1000e crypto_simd snd_hda_intel glue_helper cryptd intel_cstate intel_rapl_perf mei_me snd_soc_ssm4567 snd_hda_codec pcspkr snd_soc_rt5640 intel_gtt snd_soc_rl6231 agpgart i2c_i801 tpm_tis shpchp lpc_ich syscopyarea ptp ir_rc6_decoder mei tpm_tis_core sysfillrect snd_hda_core pps_core sysimgblt fb_sys_fops thermal fan tpm btusb snd_hwdep rc_rc6_mce snd_soc_core btrtl ir_lirc_codec lirc_dev btbcm btintel nuvoton_cir battery snd_compress rc_core snd_pcm_dmaengine snd_soc_sst_acpi snd_pcm video snd_soc_sst_match elan_i2c bluetooth snd_timer i2c_hid acpi_als 8250_dw snd kfifo_buf button soundcore industrialio ecdh_generic hid rfkill ac97_bus spi_pxa2xx_platform acpi_pad nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc sch_fq_codel ip_tables x_tables ext4 crc16 mbcache
[  547.922376]  jbd2 fscrypto sd_mod ahci libahci ehci_pci xhci_pci ehci_hcd libata xhci_hcd crc32c_intel scsi_mod usbcore usb_common sdhci_acpi sdhci serio led_class mmc_core
[  547.922396] CPU: 0 PID: 465 Comm: kworker/u8:8 Tainted: G      D         4.14.14-1-ARCH #1
[  547.922398] Hardware name:                  /NUC5i3RYB, BIOS RYBDWi35.86A.0361.2016.1202.1005 12/02/2016
[  547.922446] Workqueue: i915-dp i915_digport_work_func [i915]
[  547.922450] task: ffff9bcb81ecc9c0 task.stack: ffffaacfc1744000
[  547.922455] RIP: 0010:queue_work_on+0x17/0x40
[  547.922458] RSP: 0018:ffffaacfc1747c60 EFLAGS: 00010002
[  547.922461] RAX: 0000000000000202 RBX: 0000000000000202 RCX: 0000000000000000
[  547.922464] RDX: 00000000000003a8 RSI: ffff9bcb92007000 RDI: 0000000000000080
[  547.922466] RBP: ffffaacfc1747d20 R08: ffffffff853f3c77 R09: 0000000000000001
[  547.922469] R10: ffff9bcb6c741b88 R11: 000000000000000a R12: ffff9bcb8e4af89e
[  547.922471] R13: ffff9bcb86795000 R14: 0000000000000001 R15: ffff9bcb729b48a0
[  547.922475] FS:  0000000000000000(0000) GS:ffff9bcb96c00000(0000) knlGS:0000000000000000
[  547.922477] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  547.922480] CR2: 00000000000003a8 CR3: 000000002a00a004 CR4: 00000000003606f0
[  547.922486] Call Trace:
[  547.922501]  drm_dp_mst_handle_up_req+0x4fc/0x5b0 [drm_kms_helper]
[  547.922513]  ? drm_dp_mst_hpd_irq+0x60/0x890 [drm_kms_helper]
[  547.922521]  drm_dp_mst_hpd_irq+0x60/0x890 [drm_kms_helper]
[  547.922566]  ? intel_dp_check_mst_status+0x114/0x1f0 [i915]
[  547.922599]  intel_dp_check_mst_status+0x114/0x1f0 [i915]
[  547.922629]  intel_dp_hpd_pulse+0x19c/0x310 [i915]
[  547.922653]  i915_digport_work_func+0x86/0x110 [i915]
[  547.922658]  process_one_work+0x1e0/0x420
[  547.922661]  worker_thread+0x2b/0x3d0
[  547.922665]  ? process_one_work+0x420/0x420
[  547.922670]  kthread+0x11a/0x130
[  547.922673]  ? kthread_create_on_node+0x70/0x70
[  547.922676]  ret_from_fork+0x32/0x40
[  547.922679] Code: 85 e8 b9 44 04 00 e9 78 ff ff ff 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 53 9c 58 0f 1f 44 00 00 48 89 c3 fa 66 0f 1f 44 00 00 <f0> 0f ba 2a 00 73 10 31 c9 48 89 df 57 9d 0f 1f 44 00 00 89 c8 
[  547.922717] RIP: queue_work_on+0x17/0x40 RSP: ffffaacfc1747c60
[  547.922719] CR2: 00000000000003a8
[  547.922721] ---[ end trace 1f8e5b72c7c997df ]---
Comment 1 Jani Nikula 2018-02-14 13:33:42 UTC
Please add drm.debug=14 module parameter, reproduce, attach dmesg all the way from boot to the problem if you can.
Comment 2 Adam Nielsen 2018-02-14 22:17:54 UTC
Created attachment 137367 [details]
drm.debug=14 dmesg

Here's the dmesg attachment.  I can't get the kernel to crash this time, but the monitors still come up blank (and go into power saving mode) so the relevant times are:

 - 0-85: boot, all monitors working
 - 85: switch monitors off
 - 95-106: switch monitors back on again, HDMI-1 comes back up, DP-1-1 and DP-1-2 remain in power saving mode
Comment 3 Jani Saarinen 2018-03-29 07:11:59 UTC
First of all. Sorry about spam.
This is mass update for our bugs. 

Sorry if you feel this annoying but with this trying to understand if bug still valid or not.
If bug investigation still in progress, please ignore this and I apologize!

If you think this is not anymore valid, please comment to the bug that can be closed.
If you haven't tested with our latest pre-upstream tree(drm-tip), can you do that also to see if issue is valid there still and if you cannot see issue there, please comment to the bug.
Comment 4 Jani Saarinen 2018-04-25 11:17:53 UTC
Closing, please re-open is issue still exists.
There has been changes lately on drm-tip for MST.
https://cgit.freedesktop.org/drm-tip
Comment 5 Adam Nielsen 2018-05-27 00:35:03 UTC
Been running this for a while and confirming that yes indeed it seems to be fixed.

There is still an issue whereby if the system activates DPMS suspend while the monitors are off then they don't appear to come back on properly, with "xrandr" initially not showing displays behind the MST hub (but having them show on subsequent runs, while all monitors still remain in DPMS suspend).

The solution seems to be to get DPMS mode off again (press a key or SSH in and run "xset dpms force on") then powercycle the displays again, and then they will all return to normal operating mode.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.