Bug 96569 - system freezes during boot, [drm:intel_dp_start_link_train [i915]] and [drm:intel_psr_work [i915]] ERROR
Summary: system freezes during boot, [drm:intel_dp_start_link_train [i915]] and [drm:i...
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: x86-64 (AMD64) Linux (All)
: high normal
Assignee: Clinton Taylor
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-06-17 22:05 UTC by Robin Krahl
Modified: 2016-12-15 07:53 UTC (History)
5 users (show)

See Also:
i915 platform: HSW
i915 features: display/eDP, display/PSR


Attachments
output of `journalctl -k` after freeze with option drm.debug=0xe (125.61 KB, text/x-log)
2016-06-17 22:05 UTC, Robin Krahl
no flags Details
Workaround: don't retrain the link from long pulse (758 bytes, patch)
2016-07-13 15:10 UTC, Ander Conselvan de Oliveira
no flags Details | Splinter Review
output of `journalctl -k` with patched kernel (88.31 KB, text/x-log)
2016-07-23 20:02 UTC, Robin Krahl
no flags Details

Description Robin Krahl 2016-06-17 22:05:04 UTC
Created attachment 124583 [details]
output of `journalctl -k` after freeze with option drm.debug=0xe

During or directly after the boot and before the tty login, the system freezes.  It can only be shut down using Magic SysRQ.  The system log contains these errors (full log attached):

[drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting
[drm:intel_psr_work [i915]] *ERROR* Timed out waiting for PSR Idle for re-enable

The second error repeats multiple times per seconds, until the system is shutdown.

The problem can be reproduced using Linux v4.7-rc3, drm-intel-nightly (8bf2b76) and the current git/torvalds version (g9cbbef4).

Hardware: Toshiba Portege Z30-A notebook, no additonal devices connected
Linux version: Arch
$ lspci | grep VGA
00:02.0 VGA compatible controller: Intel Corporation Haswell-ULT Integrated Graphics Controller (rev 0b)
Comment 1 Jani Nikula 2016-06-20 08:52:00 UTC
This is silly.

Based on the logs, I presume this happens before we enable eDP for the first time:

-> intel_dp_detect
-> intel_dp_long_pulse
-> intel_dp_check_link_status (since it's eDP, connector status remains connected)
-> apparently the crtc is active, but the channel eq is not okay, because well, we've never trained before, and then we go on trying to retrain

Ander, any ideas when we caused this to happen?
Comment 2 Ander Conselvan de Oliveira 2016-07-13 15:10:47 UTC
Created attachment 125056 [details] [review]
Workaround: don't retrain the link from long pulse

(In reply to Jani Nikula from comment #1)
> This is silly.
> 
> Based on the logs, I presume this happens before we enable eDP for the first
> time:
> 
> -> intel_dp_detect
> -> intel_dp_long_pulse
> -> intel_dp_check_link_status (since it's eDP, connector status remains
> connected)
> -> apparently the crtc is active, but the channel eq is not okay, because
> well, we've never trained before, and then we go on trying to retrain
> 
> Ander, any ideas when we caused this to happen?

Probably in the commit below. We had logic to retrain the link from long hpd, and that series changed intel_dp_detect() to share code with the long pulse handling. The problem is that the long pulse handling is also called from output polling during boot and resume from suspend and ends up calling that.

I'm not sure what the proper fix would be, but the attached patch would confirm that's the issue.

commit 7d23e3c37bb3fc6952dc84007ee60cb533fd2d5c
Author: Shubhangi Shrivastava <shubhangi.shrivastava@intel.com>
Date:   Wed Mar 30 18:05:23 2016 +0530

    drm/i915: Cleaning up intel_dp_hpd_pulse
Comment 3 Robin Krahl 2016-07-13 21:13:09 UTC
(In reply to Ander Conselvan de Oliveira from comment #2)
> Created attachment 125056 [details] [review] [review]
> Workaround: don't retrain the link from long pulse
> 
> […]
> 
> I'm not sure what the proper fix would be, but the attached patch would
> confirm that's the issue.

Thanks!  Unfortunately the patch (applied against v4.7-rc1) does not solve the problem in the original setup.

But I found out by accident that apparently the system does not actually freeze, it’s just the internal monitor that is not updated.  If I connect an external monitor, I can login and work (though the tty is flooded with the time out errors).

Should I provide logs from the patched kernel, or test whether the external monitor also works with the unpatched kernel?
Comment 4 Ander Conselvan de Oliveira 2016-07-14 08:56:11 UTC
(In reply to Robin Krahl from comment #3)
> (In reply to Ander Conselvan de Oliveira from comment #2)
> > Created attachment 125056 [details] [review] [review] [review]
> > Workaround: don't retrain the link from long pulse
> > 
> > […]
> > 
> > I'm not sure what the proper fix would be, but the attached patch would
> > confirm that's the issue.
> 
> Thanks!  Unfortunately the patch (applied against v4.7-rc1) does not solve
> the problem in the original setup.
> 
> But I found out by accident that apparently the system does not actually
> freeze, it’s just the internal monitor that is not updated.  If I connect an
> external monitor, I can login and work (though the tty is flooded with the
> time out errors).
> 
> Should I provide logs from the patched kernel, or test whether the external
> monitor also works with the unpatched kernel?

Please provide the logs for the patched kernel. There were two different error messages in the log, and the patch would fix the first one: "*ERROR* failed to train DP, aborting". This may or may not be related to the second error message and the frozen screen.
Comment 5 Robin Krahl 2016-07-23 20:02:11 UTC
Created attachment 125282 [details]
output of `journalctl -k` with patched kernel

Okay, I added the log of the patched kernel (4.7-rc1).
Comment 6 Ander Conselvan de Oliveira 2016-07-25 12:22:07 UTC
(In reply to Robin Krahl from comment #5)
> Created attachment 125282 [details]
> output of `journalctl -k` with patched kernel
> 
> Okay, I added the log of the patched kernel (4.7-rc1).

So, it does seem like the two issues are not related. The link training errors went away.

Rodrigo, do you know what could cause the PSR wait-for-idle timeout?
Comment 7 Erik Ekman 2016-08-31 16:39:00 UTC
I had this problem also, and bisected the following kernel commit as the cause: 03b7b5f983091bca1, drm/i915/psr: Try to program link training times correctly

which was done as a fix to https://bugs.freedesktop.org/show_bug.cgi?id=95176

I have Toshiba Portege Z30-A-15M with i7-4500U and in the linked bug another Toshiba user also had a problem. I tested with an external hdmi screen now and I can confirm it is just the laptop display not redrawing, not a complete hang.
Comment 8 Erik Ekman 2016-09-28 16:32:32 UTC
Hi again

Is there anything we can help with on this bug? Seems to be Toshiba specific.
Comment 9 flying-sheep 2016-12-12 09:34:44 UTC
I think I have the same issue. I have posted some info here: https://bbs.archlinux.org/viewtopic.php?pid=1648897
Comment 10 flying-sheep 2016-12-12 09:39:11 UTC
for me the external display also hangs, and the bug only occurs when using a linux kernel version >4.5.4

the newest versions i tried it with are

linux 4.8.13
xf86-video-intel 1:2.99.917+746+g169c74f-1

using i915.enable_psr=0 doesn’t help.
Comment 11 Paulo Zanoni 2016-12-14 20:37:37 UTC
(In reply to flying-sheep from comment #10)
> for me the external display also hangs, and the bug only occurs when using a
> linux kernel version >4.5.4
> 
> the newest versions i tried it with are
> 
> linux 4.8.13
> xf86-video-intel 1:2.99.917+746+g169c74f-1
> 
> using i915.enable_psr=0 doesn’t help.

If that's the case, please open a new bug report. Boot the latest Kernel from drm-tip (https://cgit.freedesktop.org/drm-tip) with drm.debug=0xe log_buf_len=1M, then attach the dmesg output to the bug report. PSR was just disabled by defaul on drm-tip so any error messages related to PSR should just go away now.
Comment 12 Paulo Zanoni 2016-12-14 20:38:41 UTC
We just merged a patch to disable PSR by default:

commit 2ee7dc497e348eecbb82adbb1ea9e9a7e29fe921
    drm/i915: disable PSR by default on HSW/BDW

This commit is marked for inclusion in the stable Kernels, so it should reach your Linux distribution at some point soon.

Thank you for your bug report. In case you think the problem still happens, please feel free to reopen the bug. Please also make sure to re-generate the log files with the latest drm-tip tree and attach them here, since at least the PSR-related error messages should be gone now.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.