106250 – Regression with Dell TB16 dock and Linux kernel 4.16.x

Bug 106250 - Regression with Dell TB16 dock and Linux kernel 4.16.x

Summary: Regression with Dell TB16 dock and Linux kernel 4.16.x

Status:	CLOSED NOTOURBUG

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	high major
Assignee:	Stanislav Lisovskiy
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	Triaged, ReadyForDev
Keywords:

Duplicates (1):	109059 (view as bug list)
Depends on:
Blocks:

Reported:	2018-04-26 10:57 UTC by Patrik Flykt
Modified:	2019-03-13 07:13 UTC (History)
CC List:	5 users (show)

See Also:
i915 platform:	KBL
i915 features:	display/USB-C

Attachments
Cae for not working external display (5.85 KB, text/plain) 2018-08-06 07:53 UTC, Stanislav Lisovskiy	no flags	Details
Case for working external display (6.73 KB, text/plain) 2018-08-06 07:54 UTC, Stanislav Lisovskiy	no flags	Details
Temporary fix (2.59 KB, patch) 2018-11-16 08:56 UTC, Stanislav Lisovskiy	no flags	Details \| Splinter Review
Userspace fix for Intel DDX(xf86-video-intel) (3.08 KB, patch) 2018-11-19 16:31 UTC, Stanislav Lisovskiy	no flags	Details \| Splinter Review
Kernel fix (2.31 KB, patch) 2018-11-19 16:33 UTC, Stanislav Lisovskiy	no flags	Details \| Splinter Review
View All

Description Patrik Flykt 2018-04-26 10:57:11 UTC

Dell XPS13 connected via Thunderbolt to Dell TB16 docking station as described in bug #103645 is showing a regression when running a 4.16.[1234] kernel. Symptoms are a black power saved monitor connected to the DP output of TB16 after connecting the docking station. Keyboard connected via the docking station still works, so Thunderbolt itself is working fine. The patch from bug #104425 was tried out, but did not make any difference to the DP connected monitor.

Kernel 4.15.x works just fine (or as fine it can work taking into account bug #103645) with the DP monitor connected.

Comment 1 Jani Saarinen 2018-04-27 06:33:34 UTC

Is there any further logs you can provide here?

Comment 2 Jani Saarinen 2018-05-02 06:49:11 UTC

ping, or are all relevant ones in other bug?

Comment 3 Stanislav Lisovskiy 2018-06-04 10:29:35 UTC

Did you try to unplug and plug again? I think, I'm observing something similar(black screen on both integrated and external displays) with my Dell docking station as well, however it happens only on login screen and can be cured by replugging the USB C cable. Just want to check if this is the same one. If it is same, I can just debug it with my own laptop..

Comment 4 Patrik Flykt 2018-06-04 10:37:48 UTC

I have been pluggin in/out the USB-C Thunderbolt cable a few times, and at least the DP connected monitor comes out all black/power save, and usually also the HDMI one.

Comment 5 Stanislav Lisovskiy 2018-06-04 12:35:43 UTC

I could reproduce the problem with your docking station and it is slightly different from what I had with mine. For me there is definitely usb driver stack is involved because I get actually some errors from usbhid and external keyboard/mouse occasionally not work.

Comment 6 Stanislav Lisovskiy 2018-06-05 10:53:45 UTC

I think, I could fix the mentioned above issues, at least on mine machine.
The issue here actually consists of two issues: first one is something about Thunderbolt docking station initialization, it seems that it doesn't work properly if it is used as a separate module(periodically I had disconnecting mouse and keyboard, also dislpay). After I started using Thunderbolt as part of kernel it got significantly better.
The second one, i.e black screen seems to be caused by wrong watermark calculations, after adding drm.debug I could see multiple error messages from skl_compute_wm function telling that "requested configuration exceeds maximum wm level". To check this hypothesis, I've implemented a small hack, so that if required res_blocks turns out to be >= than ddb_allocation that just assign.
After those changes I don't get black screen anymore, even after multiple connecting/disconnecting Thunderbolts USB-C cable. 

However some USB peripherals might still stop working after that, which I believe is a USB hub related problem.

Comment 7 Stanislav Lisovskiy 2018-06-07 14:23:50 UTC

Minor update: looks like the watermarks go outside ddb_allocation, when framebuffer is in Y_Tiled mode, first of all drm_framebuffer_init complains with "no Y Tiling for legacy addfb", however it seems to continue using Y Tiling when computing watermarks as wp->y_tiled is set to true, which leads it to go outside ddb_allocation for resolution 3840x2560 used with eDP, which leads to black screen.
I could fix it either by forcing X Tiling instead of Y Tiling or just by assigning res_blocks to the current maximum ddb_allocation value.

Which way is correct here, I still have to understand.

Comment 8 Stanislav Lisovskiy 2018-06-08 07:26:08 UTC

After discussing with Ville to me it looks either there might be an issue with watermark calculation algorithm(I will check with bspec if there is a mistake) and also there is an issue with drm_mode_addfb2 as userspace attempts to use Y Tiling for buffer object, but without using explicit FB modifiers, which leads to an error. It seems to fall back to X Tiling which doesn't exceed watermarks, only once we replug the display.

Comment 9 Mahesh Kumar 2018-06-11 08:31:38 UTC

If there are any logs available, I can provide more accurate debug. My analysis is:
In Gen-9 we have enabled few Display WA's which are increasing WM requirement almost by double for Y-tiling. Which may be resulting in WM requirement being 
more than available DDB.

Assigning (res_blocks = ddb_allocation - 1) is not right solution, as with that we'll violate HW watermark requirement (it may work for some scenario but not really a right solution).
If failing during sanitize_watermarks will make userspace to fallback to X-tiling then that's right solution.

-Mahesh

Comment 10 Stanislav Lisovskiy 2018-07-19 13:55:55 UTC

After discussion it was decided that I need to write a summary in order to escalate this issue further. Problem is that for some architectures Y Tiled mode goes beyond DDB resources available, especially when used with multiple 4K displays. 
Currently userspace has no way or procedure to determine and fallback to proper
or at least working display mode, so in some cases, drm_mode_setcrtc fails like with this bug and we get black screen. 
It was proposed that we might need to fix this in mesa in order to make it be able to determine if the display mode fits into WM requirements, before it attempts to do a modeset. One, heuristic way could be to probe, if the ddb_allocation can fit at least twice as required buffer(most common scenario) in order to be able to determine if Y Tiled mode is usable, otherwise switch to X Tiled mode. So basically fixing this might require changes both in user and kernel space. In kernel space, an ioctl to query the minimum required ddb_allocation then might be needed.
There is also a proposal to develop, a correspondent stress test case in IGT, in order to be able to detect similar kind of situations.

Comment 11 Patrik Flykt 2018-07-20 15:11:55 UTC

Yes, please fix this properly even if it takes more time. Meanwhile, does upgrading to a newer kernel version help with the problem in any way? I'm running upstream Linus kernels, so upgrading a kernel is not a problem. With Mesa and other userland I'm more stuck with what is available from distros, e.g. Debian testing.

Comment 12 Stanislav Lisovskiy 2018-08-06 07:53:06 UTC

I've discovered one more "funny" issue, which probably Patrik was also facing:
each second time, when Thunderbolt is disconnected and then connected back, the external display doesn't work.
I've checked with logs + added own traces - seems that kernel(4.18-rc7) sends hotplug event as it should, however each second time, we don't get drm_mode_setcrtc for PIPE_B from userspace. To me it looks like a userspace issue, as kernel seems to reach properly. 
I've attached the corresponding logs (not_working_pipe_b - for not working case and working_pipe_b - for working case).
It changes for each second time mostly, which indicates some logic problem and the only difference is that we simply don't get drm_mode_setcrtc for pipe B despite that hotplug was sent through sysfs, which was verified by additional traces(see the logs).

Comment 13 Stanislav Lisovskiy 2018-08-06 07:53:47 UTC

Created attachment 140970 [details]
Cae for not working external display

Comment 14 Stanislav Lisovskiy 2018-08-06 07:54:29 UTC

Created attachment 140971 [details]
Case for working external display

Comment 15 Stanislav Lisovskiy 2018-08-28 08:38:21 UTC

It has been confirmed that correspondent uevent is delivered to the userspace by observing correspondent traces in ddx sna_handle_events gets uevent which is then analyzed by sna_mode_discover which in turn sends the randr notification (RRTellChanged, RRCrtcNotify) to xserver clients. In non-buggy case we get a message ProcRRSetScreenConfig from the desktop clients in response, which then triggers modeset(drm_mode_setcrtc). 
However this doesn't happen for some reason when bug happens, despite that DRM connector state is the same as in non-buggy case.

After discussion with Martin, made a decision that this bug can be closed as non-fixable at least from kernel side and correspondent bug filed for gnome-desktop/Ubuntu.

Comment 16 Patrik Flykt 2018-08-28 15:10:40 UTC

(In reply to Stanislav Lisovskiy from comment #15)
> After discussion with Martin, made a decision that this bug can be closed as
> non-fixable at least from kernel side and correspondent bug filed for
> gnome-desktop/Ubuntu.

Sounds like a plan. Please add a link to the bug for gnome-desktop/Ubuntu/etc. here, and remember to add all proper adjectives when describing it so that it will get fixed rather sooner than later. It would be great if someone also could keep an eye on the bug and report back here in which version of what desktop component it got fixed so that updated distro versions would be easier to track.

Comment 17 Stanislav Lisovskiy 2018-08-29 13:12:18 UTC

The only thing I can so far recommend at least as one simple workaround is once this happens you can execute xrandr --output <output name> --crtc <some crtc number>. Output name can be figured out by executing xrandr without arguments, as for crtc id, I just tried different ones until secondary screen starts working properly. Eventually you can manually restore secondary screen to proper state by using this command(it just basically sends ProcRRSetScreenConfig request, which we are lacking from desktop manager).

Comment 18 Lakshmi 2018-09-10 07:07:35 UTC

Patrik, can you try above Stanislav instructions?

Comment 19 Patrik Flykt 2018-09-10 14:04:45 UTC

I can try it as soon as I get my dock back from Stanislav. Meanwhile, what is the bug filed for userland, so that those interested can follow it?

Comment 20 Stanislav Lisovskiy 2018-09-10 14:14:32 UTC

To be honest I didn't file a bug yet. I think I need some time still to figure out where exactly it should filed.

Comment 21 Benjamin Berg 2018-11-15 14:22:26 UTC

So, I think I had a user here who was running into this bug (on F29). Specifically, what happens for him is the following:

 * Boot with dock plugged
-> External monitor is on connector DP-1-2
 * Unplug dock
 * Plug dock back in
-> External monitor is on connector DP-5 (according to sysfs)

After this xrandr still reports the display on connector DP-1-2 though. The first time the display could not be configured at this point, the second time it worked despite the inconsistency.

Comment 22 Stanislav Lisovskiy 2018-11-15 14:59:27 UTC

I've located a problem. It was in xserver/ddx code, at some point thinking crtc hasn't changed, while it did. I've made some raw patch for xserver and it works for me, but I haven't sent it yet, because basically it just removes part of crtc checking code, which prevents drm_mode_setcrtc ioctl to be done.

As a temporary workaround you can try xrandr --output (your not working output name) --crtc (some number here) - this will force drm_mode_setcrtc call to be made and the screen will be back.
I can also post my temporary fix here, however it requires rebuilding xserver from scratch and then installing it.

Comment 23 Stanislav Lisovskiy 2018-11-16 08:56:09 UTC

Created attachment 142488 [details] [review]
Temporary fix

Attaching here a temporary fix for the xserver, if somebody wants to get it fixed, before I figure out the correct way to fix this and send this patch to upstream.

Comment 24 Patrik Flykt 2018-11-16 15:38:55 UTC

So that fix relates to X. Is there something similar to be done for Wayland? I should perhaps re-test all this, but right now I have only one DP monitor connected to the Dell dock.

Comment 25 Stanislav Lisovskiy 2018-11-19 16:30:49 UTC

I think I've identified the real reason for this bug: the problem is that kernel
allocates dynamically new connector id for DP MST devices each time it is plugged/unplugged and adding/removing correspondent connectors. That seems to confuse userspace into thinking that connector is still in a connected state, thus leading to a lost modeset. 
In order to fix that, we must either return only active DP MST connectors or check connector states more carefully on userspace side.
I've implemented both fixes, however not sure which one is correct and some things still need to be understood. 

However, it could be helpful if somebody tries those and report if it fixes problem. Userspace is implemented for Intel DDX, something similar I guess might be needed for XWayland or modesetting.

Comment 26 Stanislav Lisovskiy 2018-11-19 16:31:43 UTC

Created attachment 142519 [details] [review]
Userspace fix for Intel DDX(xf86-video-intel)

Userspace fix for Intel DDX(xf86-video-intel)

Comment 27 Stanislav Lisovskiy 2018-11-19 16:33:31 UTC

Created attachment 142520 [details] [review]
Kernel fix

Kernel fix, should work without userspace changes, however not sure if that it correct still.

Comment 28 Benjamin Berg 2018-11-19 22:34:42 UTC

The patch sounds promising to me. I can have a look if I can reproduce the issue and make a test build to check whether it fixes the problem.

AFAIK, mutter (i.e. the GNOME wayland compositor) is not affected by this issue.

Comment 29 Benjamin Berg 2018-11-21 09:48:42 UTC

So, I did make a build for F29 with the patch:
  https://koji.fedoraproject.org/koji/taskinfo?taskID=31014179

Unfortunately, the user I had and also myself are be unable to reproduce the issue properly. i.e. the monitor appears to come back correctly at least most of the times.

Comment 30 Stanislav Lisovskiy 2018-11-21 10:41:38 UTC

(In reply to Benjamin Berg from comment #29)
> So, I did make a build for F29 with the patch:
>   https://koji.fedoraproject.org/koji/taskinfo?taskID=31014179
> 
> Unfortunately, the user I had and also myself are be unable to reproduce the
> issue properly. i.e. the monitor appears to come back correctly at least
> most of the times.

Do you mean, that it comes back properly without the patch? 
For faster reproduction, consider plugging right immediately after unplugging - for me it reproduces almost every second time with recent drm-tip.

Comment 31 Benjamin Berg 2018-11-21 11:32:37 UTC

(In reply to Stanislav Lisovskiy from comment #30)
> Do you mean, that it comes back properly without the patch?

Yeah. I had a user on F29 where the display on the dock could not be configured anymore (X11, cinamon, T480s, Thunderbolt dock, after being away for an hour). However, this only happened exactly once. Since then neither the user nor myself have been able to reproduce the issue using different setups.

Comment 32 Jani Saarinen 2018-12-18 12:26:16 UTC

*** Bug 109059 has been marked as a duplicate of this bug. ***

Comment 33 Jani Saarinen 2018-12-18 19:50:44 UTC

*** Bug 109059 has been marked as a duplicate of this bug. ***

Comment 34 Lakshmi 2019-02-07 09:14:26 UTC

(In reply to Benjamin Berg from comment #31)
> (In reply to Stanislav Lisovskiy from comment #30)
> > Do you mean, that it comes back properly without the patch?
> 
> Yeah. I had a user on F29 where the display on the dock could not be
> configured anymore (X11, cinamon, T480s, Thunderbolt dock, after being away
> for an hour). However, this only happened exactly once. Since then neither
> the user nor myself have been able to reproduce the issue using different
> setups.

Do you still have this issue with latest drmtip?

Comment 35 Stanislav Lisovskiy 2019-02-26 11:26:16 UTC

(In reply to Lakshmi from comment #34)
> (In reply to Benjamin Berg from comment #31)
> > (In reply to Stanislav Lisovskiy from comment #30)
> > > Do you mean, that it comes back properly without the patch?
> > 
> > Yeah. I had a user on F29 where the display on the dock could not be
> > configured anymore (X11, cinamon, T480s, Thunderbolt dock, after being away
> > for an hour). However, this only happened exactly once. Since then neither
> > the user nor myself have been able to reproduce the issue using different
> > setups.
> 
> Do you still have this issue with latest drmtip?

I checked it. I also checked again - and it seems that mostly the issue is on the user space side. When this happens, kernel detects and reports everything correnctly. I checked that we get correspondent uevent, and also GET_CONNECTOR ioctls show correct output statuses. I traced the issue up to the point when intel ddx driver sends an update to desktop manager, which however doesn't trigger a modeset. I think it most likely gets confused because kernel allocates and new connector id, each time DP MST device is connected or disconnected, however as I understand this is the way currently, how it is supposed to work.

Comment 36 Lakshmi 2019-03-13 07:12:41 UTC

Patrik, The outcome of the investigation is that, this issue is related to Gnome desktop manager. Please report this issue here https://gitlab.gnome.org/GNOME/mutter/issues

Closing this issue.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.