Bug 97666 - Kernel "NULL pointer dereference" with MST monitor on HSW
Summary: Kernel "NULL pointer dereference" with MST monitor on HSW
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: Other All
: highest blocker
Assignee: Carlos Santa
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords: bisected, regression
: 98408 (view as bug list)
Depends on:
Blocks:
 
Reported: 2016-09-08 19:28 UTC by Kirill A. Shutemov
Modified: 2016-11-22 08:52 UTC (History)
6 users (show)

See Also:
i915 platform: HSW, SKL
i915 features: display/DP MST


Attachments
dmesg with drm.debug=0xe (243.67 KB, text/plain)
2016-09-08 19:28 UTC, Kirill A. Shutemov
no flags Details
xrandr (7.73 KB, text/plain)
2016-09-08 19:28 UTC, Kirill A. Shutemov
no flags Details
Different logs (429.95 KB, application/x-gzip)
2016-10-04 19:43 UTC, Elio
no flags Details
dp mst null pointer deref on broadwell (19.42 KB, text/plain)
2016-10-05 14:45 UTC, Jani Nikula
no flags Details
stack trace (2.65 MB, text/plain)
2016-10-09 23:09 UTC, Carlos Santa
no flags Details
debug patch null pointer (1.67 KB, patch)
2016-10-09 23:09 UTC, Carlos Santa
no flags Details | Splinter Review
dmesg for drm-intel-nightly, de34f4da7f62 (249.23 KB, text/plain)
2016-10-11 22:16 UTC, Kirill A. Shutemov
no flags Details
dmesg for drm-intel-nightly, 41409515b7b3 (305.09 KB, text/plain)
2016-10-11 22:48 UTC, Kirill A. Shutemov
no flags Details
10-21-2016.dmesg stack trace (187.69 KB, text/plain)
2016-10-22 01:46 UTC, Carlos Santa
no flags Details

Description Kirill A. Shutemov 2016-09-08 19:28:03 UTC
Created attachment 126335 [details]
dmesg with drm.debug=0xe

Steps to reproduce:

1. Boot machine (Haswell, also saw on Skylake) with MST monitor (Dell UP2414Q, DP1.2 enabled) attached. It boots fine.

2. Disable the monitor with button on front.

3. Re-enable the monitor. Nothing appears on the screen. It goes to sleep.

4. Switch to linux console (C-A-F1) and back. It brings the monitor back alive.

5. Switch to linux console again. Kernel crashes.
Comment 1 Kirill A. Shutemov 2016-09-08 19:28:20 UTC
Created attachment 126336 [details]
xrandr
Comment 2 Kirill A. Shutemov 2016-09-17 23:19:50 UTC
ping?
Comment 3 Rui Tiago Matos 2016-09-30 15:23:15 UTC
Also hitting this with kernel 4.8.0-0.rc8.git0.1.fc25.x86_64 on Haswell.

Happens just by exiting an X session, i.e. the X server exiting.
Comment 4 Elio 2016-10-04 19:43:44 UTC
Created attachment 127006 [details]
Different logs
Comment 5 Elio 2016-10-04 19:44:21 UTC
Hello all,

I was trying with different Kernels from: 

http://kernel.ubuntu.com/~kernel-ppa/mainline/

Including 4.7, 4.8.0 and 4.8rc8

The problem with 4.7 and beyond, before 4.8 seems to be CPU pipe A FIFO underrun

For 4.8 and beyond, the error is a little bit more specific. Blaming the dp_link_training_clock.

Hardware: Intel Nuc 6i5SYH

In some cases i need an extra monitor because 4k monitor didn't come back from sleep


I tried with different software configuration as well facing same behavior. 

Software configuration:

Component         : drm
	url       : http://cgit.freedesktop.org/mesa/drm
	tag       : libdrm-2.4.68
	commit    : fc09c5ab84240e9b6bd0bed01685ef004f56c4fa
	author    : Kenneth Graunke <kenneth@whitecape.org>
	age       : 4 months ago
	
Component         : mesa
	url       : http://cgit.freedesktop.org/mesa/mesa
	tag       : mesa-12.0.1
	commit    : 04277f058d00238937e664cf546c43b16cea7b2b
	author    : Emil Velikov <emil.velikov@collabora.com>
	age       : 8 weeks ago
	
Component         : xf86-video-intel
	url       : http://cgit.freedesktop.org/xorg/driver/xf86-video-intel
	tag       : 2.99.917-701-g205146b
	commit    : 205146b0fdc8db016e5cfeeae5a6b25df3470ebc
	author    : Chris Wilson <chris@chris-wilson.co.uk>
	

Component         : libva
	url       : http://cgit.freedesktop.org/libva
	tag       : libva-1.7.2.pre1
	commit    : 9927bd2fbbb1d923fd6a8932a3cdbb5c9185ee22
	author    : Xiang Haihao <haihao.xiang@intel.com>
	

Component         : intel-driver
	url       : http://cgit.freedesktop.org/vaapi/intel-driver
	tag       : 1.7.2.pre1
	commit    : ce444fb412966ca6afbb1331b7cae8ab621c1108
	author    : Xiang Haihao <haihao.xiang@intel.com>
	

Component         : cairo
	url       : http://cgit.freedesktop.org/cairo
	tag       : 1.15.2
	commit    : db8a7f1697c49ae4942d2aa49eed52dd73dd9c7a
	author    : Bryce Harrington <bryce@osg.samsung.com>
	

Component         : xserver
	url       : http://cgit.freedesktop.org/xorg/xserver
	tag       : xorg-server-1.18.3
	commit    : 9454cd51da9b38b974cff7c8b7125901f6403848
	author    : Adam Jackson <ajax@redhat.com>
	
Component         : macros
	url       : https://cgit.freedesktop.org/xorg/util/macros
	tag       : util-macros-1.19.0-2-gd7acec2
	commit    : d7acec2
	author    : Andreas Boll <andreas.boll.dev@gmail.com>
	age       : 8 months ago
	

Component         : intel-gpu-tools
	url       : https://cgit.freedesktop.org/xorg/app/intel-gpu-tools
	tag       : intel-gpu-tools-1.16
	commit    : a28e9e38a9efc6daf5a08d60d29adcd3e328fe6f
	author    : Marius Vlad <marius.c.vlad@intel.com>
Comment 6 Jani Nikula 2016-10-05 14:45:02 UTC
Created attachment 127029 [details]
dp mst null pointer deref on broadwell

Reproduced on Broadwell with daisy chained DP MST setup. Reproduced by simply switching off and back on the first display in the chain. Attached dmesg starts from hitting the power button to re-enable display. Today's drm-intel-nightly.
Comment 7 Jim Bride 2016-10-05 21:49:26 UTC
From looking at the dmesg Jani posted, it seems like the NULL pointer dereference is with the topology discovery / management code in the DRM driver.  It's definitely the sideband worker thread that had the crash, if the stack trace is to be trusted.  In Krill's case the crash happens in the frame buffer helpers, but ultimately it's likely the same condition that caused the problem.  Powering down the primary monitor likely also means powering down the primary DP MST branch device in the case of powering down the first monitor in a series of daisy-chained monitors.  I wouldn't necessarily expect any of the downstream monitors to work right in that case, and honestly we probably ought to be able to catch and deal with this case like an unplug event.  Our drivers should handle that case gracefully, however.
Comment 8 Carlos Santa 2016-10-06 01:40:48 UTC
I am trying to get my daisy chain DP MST setup up and running but somehow the second monitor is not turning on, I am using the Dell Monitor which has the support for MST embedded within the monitor itself (not using the hub). Will sync w/ Jim tomorrow on this.

Jim also suggested to manually unplugging the power when connected to the DP MST hub and with that it reproduces a CPU pipe A FIFO underrun (it seems to be a different problem than the null pointer because the kernel didn't crash, only errored out).
Comment 9 Jani Nikula 2016-10-06 10:29:49 UTC
Fails the same way with a DP MST capable monitor also *without* the secondary display.
Comment 10 Carlos Santa 2016-10-07 01:33:31 UTC
I was able to get the daisy chain Monitor setup up and running using both the Dell Monitor with built-in support for MST and the HUB and still I am not able to see the null pointer. I am only able to see the CPU pipe A FIFO underrun as reported by Elio above, but that's a separate issue than the null pointer being reported. I was trying things on SKY, will try on HSW and BDW tomorrow.
Comment 11 Carlos Santa 2016-10-09 23:08:24 UTC
unfortunately the steps to repro from comment #1 are not working for me, so it's making progress on this bug rather slow. I was able to see the null pointer only one time, however the stack trace although similar it wasn't exactly the same (see attachment). Specifically, step #4, where I am supposed to get the screen back is not happening for me, it stays black the entire time with no crash. Step #5 above, where another mode setting occurs I am able to get the screen back but no crash either. (I am on skylake + nightly). I have been trying the above steps consistently 10+ times.

Looking at the stack trace and the problem appears to be starting when drm_probe_ddc() gets called from drm_get_edid() inside drm_edid.c. The adapter as a parameter is being passed but never check for null, specifically the call to drm_do_probe_ddc_edid() derefences some of the fields of adapter struct when the call to i2c_transfer() gets called, most likely causing the crash. I am attaching a debug patch as reference.
Comment 12 Carlos Santa 2016-10-09 23:09:02 UTC
Created attachment 127155 [details]
stack trace
Comment 13 Carlos Santa 2016-10-09 23:09:53 UTC
Created attachment 127156 [details] [review]
debug patch null pointer
Comment 14 Carlos Santa 2016-10-11 02:39:09 UTC
Finally, I am able to consistently repro this problem on HWS/BDW by disabling RC6 power states (i915.enable_rc6=0) at least since the drm-nightly from Oct 1 (http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-nightly/2016-10-01/

The tip of the drm-nightly at least since Oct 7 doesn't repro the null pointer exception. The display stays black but forcing a mode setting change switching from text mode (CTRL+ ALT + F1) to gfx mode (C+A+F7) "fixes" the black screen and the UI screen is able to come back.   

Next steps:

1. Ask Tomi or someone else to see if they no longer are able to repro the null pointer on the latest drm-nightly.

1. Continue bisecting the tree to see when the null pointer disappear as it seems in the latest nightly I am not able to see the nullpointer anymore.
Comment 15 Jani Nikula 2016-10-11 15:16:36 UTC
Tried today's nightly, works fine. This might be something we could use a reverse bisect for, to backport the fix.

Kiryl, can you try drm-intel-nightly branch of http://cgit.freedesktop.org/drm-intel to see if you can reproduce with that?
Comment 16 Kirill A. Shutemov 2016-10-11 22:14:21 UTC
Nope. The same.
Comment 17 Kirill A. Shutemov 2016-10-11 22:16:01 UTC
Created attachment 127227 [details]
dmesg for drm-intel-nightly, de34f4da7f62
Comment 18 Kirill A. Shutemov 2016-10-11 22:20:08 UTC
Oops. I've compiled the wrong kernel tree. It's Linus' current master.
Comment 19 Kirill A. Shutemov 2016-10-11 22:47:06 UTC
Okay, I don't see the NULL pointer dereference.

But, step 4 doesn't bring the monitor back alive anymore. Still no luck :(
Comment 20 Kirill A. Shutemov 2016-10-11 22:48:43 UTC
Created attachment 127228 [details]
dmesg for drm-intel-nightly, 41409515b7b3

Few attempts of monitor power-cycle, console-and-back, Xorg restart.
Comment 21 Carlos Santa 2016-10-11 22:51:39 UTC
(In reply to Kirill A. Shutemov from comment #19)
> Okay, I don't see the NULL pointer dereference.
> 
> But, step 4 doesn't bring the monitor back alive anymore. Still no luck :(

If you force a modeset switching from console back to graphics you can get the monitor back (CTRL+ALT+F1 -> CTRL+ALT+F7)
Comment 22 Kirill A. Shutemov 2016-10-11 23:01:03 UTC
I tried exactly that. It produces a lot of messages in dmesg, but monitor still says "no signal".
Comment 23 Carlos Santa 2016-10-12 02:48:27 UTC
Using the pre-builts drm-nightly kernels from http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-nightly/ it looks like the fix for the null pointer entered sometime between Oct 5 and Oct 6. I'll continue the bisection tomorrow.

Bad commit - cod/tip/drm-intel-nightly/2016-10-05 (2e47fbed60d05480cfbd097c1912e6823c0310ee)

Goog commmit - cod/tip/drm-intel-nightly/2016-10-06 (2dff18acaa95a26b882a5f9910d7ded514f18415)

The delta between those two nightly builds:

csanta@graypixels:~/temp2/cod/tip/drm-intel-nightly/2016-10-05$ git log --pretty=oneline cod/tip/drm-intel-nightly/2016-10-06..cod/tip/drm-intel-nightly/2016-10-04
35ceb7ced7dc45fe6db94e36e3d36089557994f4 drm-intel-nightly: 2016y-10m-03d-20h-05m-29s UTC integration manifest
c649d02d17ff44a674531e59f62d789f020ab3df Merge remote-tracking branch 'origin/topic/core-for-CI' into drm-intel-nightly
8978146d3e9d0f82c375772e0afd437a7e14c40a Merge remote-tracking branch 'origin/topic/drm-misc' into drm-intel-nightly
e274358357f992003d3602c7bffccba5e65efbb6 Merge remote-tracking branch 'sound-upstream/for-next' into drm-intel-nightly
135b00745346d3b9b61c5fcde114c982a78a1401 Merge remote-tracking branch 'drm-upstream/drm-next' into drm-intel-nightly
50d07cc835486b055c320884e4c72e3d6e8b324c Merge remote-tracking branch 'origin/drm-intel-next-queued' into drm-intel-nightly
262f75a6288346db44038c63b7a95ee68f8b7bea drm: Undo damage to page_flip_ioctl
0c238e65d7dc62c5cae138cec720fec7baa07a24 drm/rockchip: analogix_dp: Refuse to enable PSR if panel doesn't support it
a4cb6284e28b90761145f3ffa454a515ac4644ba drm/bridge: analogix_dp: Add analogix_dp_psr_supported
6c4d6f9f997c5dafccb54b52167f0c4d0ea37874 drm/fb-helper: add DRM_FB_HELPER_DEFAULT_OPS for fb_ops
e80559f71b0c35a099532c583e251f4faf572c95 drm: Document caveats around atomic event handling
5bc0fc9a545641b19bdfa32911fd755b41b428ab drm/i915: Fix build when !CONFIG_DEBUG_FS
26e19227e2d405af48ef53d421a3c5bc4b499a3d uapi: add missing install of sync_file.h
844f9ee731472c7c8ce99851e7eafc98b8190293 drm: Simplify drm_printk to reduce object size quite a bit
5a86d6fbff2e403ab9eb86d0997ee0f2435258bc drm/i915: Account for sink max TMDS clock when checking the port clock
3be453e6cddf6eeca1cf69cdff750a31be011702 drm/i915: Replace a bunch of connector->base.display_info with a local variable
462de098327ebcd63a09cbdf22df36efc1266bf9 drm/edid: Move dvi_dual/max_tmds_clock parsing out from drm_edid_to_eld()
52655e01952a944fb889be7e56781572507148ad drm/edid: Clear the old cea_rev when there's no CEA extension in the new EDID
dc4b7a0aa6d523acd8331f9590ab42d5348cc307 drm/edid: Reduce the number of times we parse the CEA extension block
b5ec14f5eb6193a46e0d7854da70123504c4ab36 drm/edid: Don't pass around drm_display_info needlessly
21da17eaa02bdf727a1e3141bdd603f2f79df407 drm/edid: Move dvi_dual/max_tmds_clock to drm_display_info
84aa822007328e00386bde83760499ab146157fc drm/edid: Make max_tmds_clock kHz instead of MHz
66f2d60f872f242c578263324e848479f6b7b786 drm/edid: Clear old dvi_dual/max_tmds_clock before parsing the new EDID
1be05eabb2ae5af798f09300fc13ed008ff13c8a drm/edid: Clear old audio latency values before parsing the new EDID
da2bf7e805921494df5ebe18e99c790b1fbb450c drm: Convert prime dma-buf <-> handle to rbtree
0482bae6012924aa01b06e33431a57b9e971888c drm: Restore lost drm_framebuffer_unreference in drm_mode_page_flip_ioctl
24454c2c4b4bfac5ea8b1d2ec6a61ea20abea6ab drm: Add frame CRC debugfs files only for drivers that have CRTC
ea568478ad07e6dcf90fc02265760f703b50fb3d drm/i915: Put "cooked" vlank counters in frame CRC lines
13fa0253d97a781bd52e836439827916a21ff2af drm/i915: Use new CRC debugfs API
48c787899882ec9c3e111a81801c82d10da17810 drm: Add API for capturing frame CRCs
21165bd933acd4b9d5bb81cc15cfa1fc1258ddc5 drm/i915/debugfs: Move out pipe CRC code
849fccd258ec702381328c0a52442d268ff96082 Merge remote-tracking branch 'airlied/drm-next' into topic/drm-misc
bd233af436baccc9c600902b1726bcd0b6ada63d drm/sti: mark symbols static where possible
490896bc2382b0ed77b78d6242103706b4d06fe4 drm/mediatek: mark symbols static where possible
492ec801668a48db7751a8cf591a19c5f79ca57a drm/rockchip: mark symbols static where possible
4e6ee0192b2d66083b56a3e0b863d556e642ffa7 drm/rockchip: add missing header dependencies
Comment 24 Carlos Santa 2016-10-13 01:50:19 UTC
To make the bisect more simpler I moved the bisection to the drm-intel tree instead of the Ubuntu one.

Using the following two commits as starting point for the bisect:

Bad commit:  a4fce9cb782ad340ee5576a38e934e5e75832dc6
Good commit: 14740bb25ec36fe4ce8042af3eb48aeb45e5bc13

the commit that introduced the null pointer seems to be this one:

27d4efc5591a5853de54713bc717de73c8951e17 is the first bad commit
commit 27d4efc5591a5853de54713bc717de73c8951e17
Author: Ville Syrjälä <ville.syrjala@linux.intel.com>
Date:   Mon Oct 3 10:55:15 2016 +0300

    drm/i915: Move long hpd handling into the hotplug work
    
    We can't rely on connector->status in the detect() hook if the long hpd
    was already handled by the dig_port_work as that won't update
    connector->status. Thus we have to defer the long hpd handling entirely
    until the hotplug work runs to avoid the double long hpd handling
    the "detect_done" flag is trying to prevent.
    
    We'll start to depend on connector->status being up to date in a
    following patch.
    
    Cc: Damien Cassou <damien@cassou.me>
    Cc: freedesktop.org@gp.mailgun.org
    Cc: Arno <blouin.arno@gmail.com>
    Cc: Shubhangi Shrivastava <shubhangi.shrivastava@intel.com>
    Cc: Sivakumar Thulasimani <sivakumar.thulasimani@intel.com>
    Cc: Ander Conselvan de Oliveira <conselvan2@gmail.com>
    Cc: stable@vger.kernel.org
    Tested-by: Arno <blouin.arno@gmail.com>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=83348
    Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
    Link: http://patchwork.freedesktop.org/patch/msgid/1475481316-8194-1-git-send-email-ville.syrjala@linux.intel.com
    Reviewed-by: Ander Conselvan de Oliveira <conselvan2@gmail.com>


which makes sense since there are changes related to display port. I'll continue tomorrow looking into the changes.
Comment 25 Carlos Santa 2016-10-14 04:00:04 UTC
Minor correction, the above commit from the bisect is the actual fix to the null pointer/black screen issue.

commit 27d4efc5591a5853de54713bc717de73c8951e17
Author: Ville Syrjälä <ville.syrjala@linux.intel.com>
Date:   Mon Oct 3 10:55:15 2016 +0300

    drm/i915: Move long hpd handling into the hotplug work

@Kirill/Jani can you please verify this at your end? 

Reverting this commit from today's nightly as well as the before/after of Oct 3 by git hard --reset 27d4efc5591a I obtain the same result.
Comment 26 Kirill A. Shutemov 2016-10-16 23:42:03 UTC
[The commit is v4.9-rc1 as 1015811609c0.]

Yes, I don't see the NULL pointer dereference anymore with the commit.

But exactly the same commit also breaks the workaround I use to wake up the screen (switch-to-console-and-back). I don't have anymore a mean to make display work again after it went to sleep. :(

May be that's a different bug, I don't know.
Comment 27 Ville Syrjala 2016-10-17 12:19:11 UTC
(In reply to Kirill A. Shutemov from comment #26)
> [The commit is v4.9-rc1 as 1015811609c0.]
> 
> Yes, I don't see the NULL pointer dereference anymore with the commit.
> 
> But exactly the same commit also breaks the workaround I use to wake up the
> screen (switch-to-console-and-back). I don't have anymore a mean to make
> display work again after it went to sleep. :(
> 
> May be that's a different bug, I don't know.

Can you grab some dmegs logs w/ drm.debug=0xe from that case?
Comment 28 Carlos Santa 2016-10-17 16:53:47 UTC
@Vylle, Kirill,

The null pointer was indirectly fix with the above commit (27d4efc5591a5853de54713bc717de73c8951e17).

Can we close this bug and open a new with the new problem you're referring to? Otherwise, this bugzilla will become too polluted.

Carlos
Comment 29 Kirill A. Shutemov 2016-10-18 09:18:16 UTC
Okay, I'll open new bug.
Comment 30 Ville Syrjala 2016-10-20 11:00:51 UTC
Reopening since the "fix" was just due to another regression, and once that's fixed the oops back.
Comment 31 Ville Syrjala 2016-10-20 11:31:32 UTC
I pushed a few potential patches to
git://github.com/vsyrjala/linux.git dp_mst_fixes

Totally untested as of now.
Comment 32 Carlos Santa 2016-10-20 23:48:59 UTC
Hi Ville,

I tested the following 3 patches on today's drm-nightly:

1. drm/i915: Refresh that status of MST capable connectors in ->detect()
2. drm/fb-helper: Fix connector ref leak on error
3. drm/fb-helper: Keep references for the current set of used connectors

You're right, after applying #1, the null pointer issue comes back but after applying #2, and doing the display power on/off to repro the bug, the system freezes and never comes back again. Same behavior if then applying patch #3 on top, which is supposed to be the fix.

Carlos
Comment 33 Ville Syrjala 2016-10-21 09:25:53 UTC
(In reply to Carlos Santa from comment #32)
> Hi Ville,
> 
> I tested the following 3 patches on today's drm-nightly:
> 
> 1. drm/i915: Refresh that status of MST capable connectors in ->detect()
> 2. drm/fb-helper: Fix connector ref leak on error
> 3. drm/fb-helper: Keep references for the current set of used connectors
> 
> You're right, after applying #1, the null pointer issue comes back but after
> applying #2, and doing the display power on/off to repro the bug, the system
> freezes and never comes back again. Same behavior if then applying patch #3
> on top, which is supposed to be the fix.

Freezes hard enough to not leave a trace in netconsole/serial etc. ? Logs would be nice.
Comment 34 Carlos Santa 2016-10-22 01:45:31 UTC
(In reply to Ville Syrjala from comment #33)
> (In reply to Carlos Santa from comment #32)
> > Hi Ville,
> > 
> > I tested the following 3 patches on today's drm-nightly:
> > 
> > 1. drm/i915: Refresh that status of MST capable connectors in ->detect()
> > 2. drm/fb-helper: Fix connector ref leak on error
> > 3. drm/fb-helper: Keep references for the current set of used connectors
> > 
> > You're right, after applying #1, the null pointer issue comes back but after
> > applying #2, and doing the display power on/off to repro the bug, the system
> > freezes and never comes back again. Same behavior if then applying patch #3
> > on top, which is supposed to be the fix.
> 
> Freezes hard enough to not leave a trace in netconsole/serial etc. ? Logs
> would be nice.

Ok, I got the log using dmesg -w. It's actually oopsing the same way as before, see the attachment 10 [details]-21-2016.

More on the traces,

the "adapter" parameter being passed to i2c_transfer() from drm_do_probe_ddc_edid()  although is not NULL it appears to arrive with invalid data, the pointer that gets de-referenced is this one "adapter->algo->master_xfer".

Adding more traces I was able to track the sequence to:

i2c_transfer()
drm_do_probe_ddc_edid()
drm_get_edid()
port->connector = (*mstb->mgr->cbs->add_connector)(mstb->mgr, port, proppath);
drm_dp_add_port(port->connector, &port->aux.ddc)
drm_dp_send_link_address()

The port->connector is what gets passed to drm_get_edid() as a reference, I am guessing that this "port" somehow is becoming invalid from within intel_dp_add_mst_connector().
Comment 35 Carlos Santa 2016-10-22 01:46:55 UTC
Created attachment 127461 [details]
10-21-2016.dmesg stack trace
Comment 36 Ville Syrjala 2016-10-24 12:17:35 UTC
*** Bug 98408 has been marked as a duplicate of this bug. ***
Comment 37 Ville Syrjala 2016-10-24 12:37:42 UTC
(In reply to Carlos Santa from comment #35)
> Created attachment 127461 [details]
> 10-21-2016.dmesg stack trace

Pushed another little thing to dp_mst_fixes
3852841d51ba ("drm/dp/mst: Clear port->pdt when tearing down the i2c adapter")

I don't particularly like how that code is structured, so can't say with any certainty that this would help, but let's try anyway.
Comment 38 Carlos Santa 2016-10-25 01:51:02 UTC
(In reply to Ville Syrjala from comment #37)
> (In reply to Carlos Santa from comment #35)
> > Created attachment 127461 [details]
> > 10-21-2016.dmesg stack trace
> 
> Pushed another little thing to dp_mst_fixes
> 3852841d51ba ("drm/dp/mst: Clear port->pdt when tearing down the i2c
> adapter")
> 
> I don't particularly like how that code is structured, so can't say with any
> certainty that this would help, but let's try anyway.

With this last patch, it still fails the same way as before, no changes to the values inside the struct that defines the port->connector, it seems it's corrupted as it's no NULL but not valid at the same time.

In my debugging I added one simple check here,

@@ -1162,7 +1164,8 @@ static void drm_dp_add_port(struct drm_dp_mst_branch *mstb,
                        drm_dp_put_port(port);
                        goto out;
                }
-               if (port->port_num >= DP_MST_LOGICAL_PORT_0) {
+               printk("************* Carlos %s %d port_num: %d port_status: %d\n",__FUNCTION__,__LINE__,port->port_num, port->conne
+               if (port->port_num >= DP_MST_LOGICAL_PORT_0 && port->connector->status != connector_status_unknown) {
                        port->cached_edid = drm_get_edid(port->connector, &port->aux.ddc);
                        drm_mode_connector_set_tile_property(port->connector);
                }

where I skip calling drm_get_edid() if the status of the port becomes "unknown",and with that the crash is avoided ( I still have to force the modeset with CRTL+F1, CRTL+F7). This shows that the struct port->connector is certainly not NULL yet it seems to arrive to this point with invalid data, perhaps corrupted?

Another observation, one has to press the On/Off button for this to become a problem, if I leave the system to suspend and then wake up the monitor with the mouse, then I don't see the crash, so maybe it's a cleanup problem during the off button sequence???
Comment 39 Ville Syrjala 2016-10-25 14:37:15 UTC
Pushed another kludge:
1547e21b7ebf ("drm/dp/mst: Check peer device type before attempting EDID read")

Let's see if that helps at all.
Comment 40 Carlos Santa 2016-10-25 21:55:54 UTC
Yeah, that does help avoid the crash, the port->pdt comes as 0x0 which is equal to DP_PEER_DEVICE_NONE.

The screen is still black after pushing the ON button but to get the display back I force a modeset with CTRL+F1 and CTRL+7.
Comment 41 Kirill A. Shutemov 2016-10-25 23:19:35 UTC
The same for me: it survived few power cycles. And it still requires console-and-back to get screen on. Is it a separate bug?
Comment 42 Ville Syrjala 2016-10-26 07:56:45 UTC
(In reply to Kirill A. Shutemov from comment #41)
> The same for me: it survived few power cycles. And it still requires
> console-and-back to get screen on. Is it a separate bug?

I haven't really thought it through, but I think what might be happening is that the fbdev hotplug processing never happens in this case. Presumably we should actually do it at the restore_fbdev time since the hotplug happened while fbdev was not in charge of things and so couldn't respond to the hotplug at the time when it actually happened. I guess we could call it a new bug, just so the bug report won't be full of oops reports. But I think we may want to hold off opening a new bug until/if these current patches are accepted. I'll go send them out now.
Comment 43 Carlos Santa 2016-10-29 01:17:50 UTC
Marking this as Resolved, as this patch fixes the crash with the null pointer

1547e21b7ebf ("drm/dp/mst: Check peer device type before attempting EDID read")

Ville has submitted it to the mailing list and it's currently under review.
Comment 44 Jani Nikula 2016-10-29 12:08:02 UTC
(In reply to Carlos Santa from comment #43)
> Marking this as Resolved, as this patch fixes the crash with the null pointer
> 
> 1547e21b7ebf ("drm/dp/mst: Check peer device type before attempting EDID
> read")
> 
> Ville has submitted it to the mailing list and it's currently under review.

Please do not close bugs before the patches have landed in drm-intel repo.
Comment 45 Jani Nikula 2016-11-07 20:51:50 UTC
commit 4da5caa6a6f82cda3193bca855235b87debf78bd
Author: Ville Syrjälä <ville.syrjala@linux.intel.com>
Date:   Wed Oct 26 12:05:55 2016 +0300

    drm/dp/mst: Check peer device type before attempting EDID read
Comment 46 yann 2016-11-08 08:14:00 UTC
(In reply to Jani Nikula from comment #45)
> commit 4da5caa6a6f82cda3193bca855235b87debf78bd
> Author: Ville Syrjälä <ville.syrjala@linux.intel.com>
> Date:   Wed Oct 26 12:05:55 2016 +0300
> 
>     drm/dp/mst: Check peer device type before attempting EDID read

closing as fixed
Comment 47 Kirill A. Shutemov 2016-11-14 01:34:59 UTC
It looks like got broken on the way upstream. Or some other bug was introduced in the same batch of patches.

dp_mst_fixes_4 from Ville's repo works fine, but upstream 4da5caa6a6f8 doesn't.
Comment 48 Carlos Santa 2016-11-16 21:37:11 UTC
@Kirill, can you be more specific about the issue that you're seeing that caused to reopen this bug?

On yesterday's drm-nightly I am not able to see the NULL pointer crash after pressing the On/Off button in the MST setup I have. This was easily reproduced before. Are you still seeing the NULL pointer or something else?
Comment 49 Ville Syrjala 2016-11-17 17:55:10 UTC
(In reply to Kirill A. Shutemov from comment #47)
> It looks like got broken on the way upstream. Or some other bug was
> introduced in the same batch of patches.
> 
> dp_mst_fixes_4 from Ville's repo works fine, but upstream 4da5caa6a6f8
> doesn't.

Still oops or just fail to light up the display?

https://patchwork.freedesktop.org/patch/117191/
hasn't made its way into Linus's tree yet. Which can explain things not lighting up.
Comment 50 Kirill A. Shutemov 2016-11-17 18:33:55 UTC
Nothing lighting up. I'll check the patch.
Comment 51 Kirill A. Shutemov 2016-11-17 23:17:47 UTC
Yes, the patch helps.
Comment 52 Jani Nikula 2016-11-18 09:31:49 UTC
fc22b787890f ("drm/i915: Refresh that status of MST capable connectors in ->detect()") is included in Dave's pull request to Linus today:

http://lkml.kernel.org/r/CAPM=9tw1YEq-FGCc7Wm=BBQc0jShEECM6_uVQwZ-BpOAsOGB=A@mail.gmail.com
Comment 53 Kirill A. Shutemov 2016-11-22 08:10:39 UTC
Linus' tree is fine now.
Comment 54 Jani Nikula 2016-11-22 08:52:24 UTC
Thanks for the follow-up, and your patience, Kirill!


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.