105811 – kernel 4.15.x boot fails at kms

Bug 105811 - kernel 4.15.x boot fails at kms

Summary: kernel 4.15.x boot fails at kms

Status:	RESOLVED NOTOURBUG

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	Triaged, ReadyForDev
Keywords:

Depends on:
Blocks:

Reported:	2018-03-29 17:26 UTC by Alec Habig
Modified:	2019-07-09 13:51 UTC (History)
CC List:	3 users (show)

See Also:
i915 platform:	IVB
i915 features:	display/atomic

Attachments
A photo of some of the kernel spew. (4.97 MB, image/jpeg) 2018-03-29 17:26 UTC, Alec Habig	no flags	Details
dmesg from initial boot into single user mode (96.28 KB, text/plain) 2018-03-30 14:24 UTC, Alec Habig	no flags	Details
dmesg from boot into runlevel 3 after ntfs filesystems removed (94.05 KB, text/plain) 2018-03-30 14:24 UTC, Alec Habig	no flags	Details
dmesg including logging after X started (161.43 KB, text/plain) 2018-03-30 14:25 UTC, Alec Habig	no flags	Details
dmesg from boot with nouveau instead of proprietary driver. (88.15 KB, text/x-log) 2018-04-25 17:49 UTC, Alec Habig	no flags	Details
dmesg from drm.debug=14 clean boot (which then hung after this log was copied...) (127.21 KB, text/x-log) 2018-04-26 20:16 UTC, Alec Habig	no flags	Details
/var/log/messages from that corresponding session, including an unfortunately sparse record of the actual hang. Last line in the dmesg corresponds to 14:44:44 in this full log. (179.56 KB, text/plain) 2018-04-26 20:17 UTC, Alec Habig	no flags	Details
full journalctl -b -1 -l dump obtained after being able to issue a clean reboot via sssh. (300.57 KB, text/x-log) 2018-04-27 14:08 UTC, Alec Habig	no flags	Details
dmesg from a successful boot with drm-tip 4.17.0-rc4, and a simpler .config. (195.06 KB, text/plain) 2018-05-10 15:56 UTC, Alec Habig	no flags	Details
View All

Description Alec Habig 2018-03-29 17:26:07 UTC

Created attachment 138431 [details]
A photo of some of the kernel spew.

On my Fedora 27 Thinkpad T430, I have been unable to boot any of the 4.15.x kernels (4.14.18 and previous work just fine).  The kernel crashes at the initial kms only 3s into the boot, before any logging to disk happens, making this one hard to provide information on.  Adding "nomodeset" to the boot lets the boot process go further: but as soon as a buffer flush tries to write to the system disk, it hangs (so again, no syslog info beyond a string of ^@'s).  kdump doesn't help: that needs to write to disk, and this bug seems to be taking out that ability as it flails.

Complicating matters further is that the screen's backlight is turned off at the start of the kms (in a normal boot it comes back on later), rendering console messages hard to read.  Luckily, I found out that with an external monitor, it will eventually (a minute or two later) light up, resulting in the attached screenshot which points to the i915 driver.

The initial bug was posted at Fedora's bugzilla:  

  https://bugzilla.redhat.com/show_bug.cgi?id=1557416

and they redirected me here (delayed by travel, sorry).

I see this Ubuntu bug, which shares a similar backtrace:

  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1751268

although my crash is not dependent on an external monitor.  

Another similar backtrace is on your site here:

  https://bugs.freedesktop.org/show_bug.cgi?id=104573

but that one doesn't seem fatal.

Ideas?  bios update to the laptop didn't change anything.  xorg update didn't change anything.

F27 is now on xorg-x11-drv-intel-2.99.917-31-20171025.  All 4.15.x kernels they've put out do the same thing, the latest being 4.15.12-301

Thanks in advance - this has been bugging me for months, and all my usual debugging techniques have been stymied!

Comment 1 Jani Saarinen 2018-03-30 12:27:18 UTC

HI, can you try with latest drm-tip (https://cgit.freedesktop.org/drm-tip and please add drm.debug=14 module parameter, and attach dmesg from boot to reproducing the problem. Thanks.

Comment 2 Alec Habig 2018-03-30 13:26:47 UTC

Will try.  Am not optimistic we'll get much, however, as the crash occurs before syslogd can write anything.

Comment 3 Alec Habig 2018-03-30 14:24:13 UTC

Created attachment 138444 [details]
dmesg from initial boot into single user mode

Comment 4 Alec Habig 2018-03-30 14:24:57 UTC

Created attachment 138445 [details]
dmesg from boot into runlevel 3 after ntfs filesystems removed

Comment 5 Alec Habig 2018-03-30 14:25:57 UTC

Created attachment 138446 [details]
dmesg including logging after X started

same boot as dmesg-booted.tx, but with some additional logging after X was started.

Comment 6 Alec Habig 2018-03-30 14:30:31 UTC

ok, boot success!  First try fell into single user mode due to ntfs filesystems the drm-tip kernel didn't like.  (the first dmesg attachement).

Commenting those out from fstab got me into runlevel three (the second dmesg attachment).  X started just fine (I command line that using startx), additional debugging logs are appended (the 3rd dmesg file).

Note that it didn't crash, so these logs, as happy as I am to get them, don't illuminate the problem (although they surely have lots of info about the intel chip in the system with the problem!).  Potential reasons include:  Maybe 4.16 fixed the problem; or maybe the drm-tip kernel doesn't include some offending but of code running in the Fedora kernel (eg, it obviously doesn't have ntfs support enabled).

Comment 7 Alec Habig 2018-03-30 15:52:01 UTC

Narrowed down the possibilities: just fished the 4.16.0-rc7 rpms for f28 out of koji: same crash, same lack of any entries into /var/log/messages whilst doing so.  So, the problem isn't fixed in the newer kernel: it them must be something that's in the Fedora kernel and not in the drm-tip kernel.

Ideas for how to narrow this down?

Comment 8 Jani Saarinen 2018-04-25 11:46:09 UTC

Jani, any advice here?

Comment 9 Jani Nikula 2018-04-25 11:54:52 UTC

[    3.562059] bbswitch: loading out-of-tree module taints kernel.
[    3.562266] bbswitch: version 0.8
[    3.562272] bbswitch: Found integrated VGA device 0000:00:02.0: \_SB_.PCI0.VID_
[    3.562278] bbswitch: Found discrete VGA device 0000:01:00.0: \_SB_.PCI0.PEG_.VID_
[    3.562286] ACPI Warning: \_SB.PCI0.PEG.VID._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20180105/nsarguments-100)
[    3.562634] bbswitch: detected an Optimus _DSM function
[    3.562644] pci 0000:01:00.0: enabling device (0000 -> 0003)
[    3.562676] bbswitch: Succesfully loaded. Discrete card 0000:01:00.0 is on

Can you disable the discrete card and run without out-of-tree stuff?

Comment 10 Alec Habig 2018-04-25 17:48:28 UTC

Good idea Jani!

Removed bumblebee and the proprietary nvidia driver.  This didn't fix things, but it did allow me to go further before things locked up.  Including actual system logging!

Attached a dmesg output I got before things fell over.  Upon booting, it seems the nouveau driver kms fails, then it falls over to i915.  You can see nouveau initializing around 2.2 seconds in the log, then timing out at around 16 seconds.  The first actual kernel crashes come later, starting with ieee80211 complaints.  I was able to dmesg to a logfile and scp it away before things really unraveled.  Hitting ctl-alt-del (all on console) induces a hard hang.

blacklisting nouveau and rebuilding the initfs, it's back to the original problem of not even booting happily.  Disabling the discrete card in bios does not change this behavior.  The tail of its problems again shows up eventually on my second monitor, looking similar to the initial attachments where drm_atomic_helper is timing out.  Wait... as I type this it eventually got to single user mode, second dmesg dump retrieved before it hang (again on ctl-alt-del).  But, dumping that to a text file didn't make it to disk, the crash didn't flush the disk write buffer :(

Comment 11 Alec Habig 2018-04-25 17:49:21 UTC

Created attachment 139108 [details]
dmesg from boot with nouveau instead of proprietary driver.

Comment 12 Alec Habig 2018-04-25 18:07:44 UTC

Slight correction - my first attempt to boot with discrete graphics enabled was foiled by there being a second bios setting that tries to autoset the graphics cards, despite me having just manually set it.  REALLY forcing discrete graphics off got me back to the exact same "hang, doesn't write to syslog, but I could take a photo of it" problem as started this thread.  Same kernel dump about ironlake and drm.

Comment 13 Alec Habig 2018-04-25 18:11:24 UTC

... and, to close the loop: 4.14.18 continues to work great (well, without bumblebee now, because it's uninstalled), as does windows 7 (dual-booted).

Comment 14 Jani Nikula 2018-04-26 07:07:40 UTC

(In reply to Alec Habig from comment #11)
> Created attachment 139108 [details]
> dmesg from boot with nouveau instead of proprietary driver.

Please add drm.debug=14 module parameter to get detailed debugging, and get the dmesg again. The attached dmesg doesn't seem to have anything out of the ordinary in graphics, the only warns come from wifi.

Comment 15 Alec Habig 2018-04-26 20:15:44 UTC

ok.  First, a note that yesterday and today's test were with kernel-4.15.17-300.fc27.x86_64

Couldn't get past a few seconds into the boot (the original failure mode) till I remembered that I'd taken the nouveau modules out of the initfs yesterday.  Put it back, booted with drm.debug-14.

It certainly seems that without nouveau taking the hit for whatever goes wrong in the initial kms, things are toast.  nouveau times out, hands off to i915, which finishes the kms.  However, if there's no nouveau in the kernel, i915 fails the initial mode swich entirely.

Anyway, this boot went clear to runlevel 3 with no errors this time, so I logged on console as root, saved a dmesg (attached).  Then exit'd from root's bash.... and things hung.  A bit later these errors came to console:

pr 26 14:47:24 entropy kernel: [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:37:pipe A] flip_done timed out
Apr 26 14:47:24 entropy kernel: [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:47:pipe B] flip_done timed out

after which I leaned on the power switch to shut down.

Since it didn't go down cleanly, "journalctl -b -1" can't count its instances correctly.  But, things did go to /var/log/messages this time, so the complete /var/log/messages for that session is also attached.  To sync the two different time series, the dmesg ends around 14:44:44 in messages.

Comment 16 Alec Habig 2018-04-26 20:16:23 UTC

Created attachment 139146 [details]
dmesg from drm.debug=14 clean boot (which then hung after this log was copied...)

Comment 17 Alec Habig 2018-04-26 20:17:21 UTC

Created attachment 139147 [details]
/var/log/messages from that corresponding session, including an unfortunately sparse record of the actual hang.  Last line in the dmesg corresponds to 14:44:44 in this full log.

Comment 18 Jani Saarinen 2018-04-27 06:22:51 UTC

There are similar issues seen on many bugs eg. on IVB: https://bugs.freedesktop.org/show_bug.cgi?id=104573.
Ville, any idea for those?

Comment 19 Jani Nikula 2018-04-27 09:30:13 UTC

(In reply to Jani Saarinen from comment #18)
> There are similar issues seen on many bugs eg. on IVB:
> https://bugs.freedesktop.org/show_bug.cgi?id=104573.
> Ville, any idea for those?

Nothing similar about that AFAICT.

Comment 20 Jani Nikula 2018-04-27 09:32:28 UTC

I also don't see an actual crash or hang here anyway. Sure, the display may be toast, but per the logs systemd gracefully shuts down on power button press. Can you ssh into the machine?

Comment 21 Jani Nikula 2018-04-27 09:39:12 UTC

Finally, the Optimus stuff have always been a pain. Life suddenly feels too short when debugging issues with them. I know it sucks for you as a user, but please understand that they're really hard for us too. We only have control over one piece of the puzzle, without much clues of the whole.

Comment 22 Alec Habig 2018-04-27 14:07:57 UTC

sshing in let me continue to work on shell in the ssh session, after the console hung upon exiting.  dmesg reports continued "flip timeout" errors as kms flails.

"shutdown -r now" at this point, although booting the ssh session promptly as expected, takes a few minutes to complete, seems to be waiting for the kms errors to complete.  This clean boot did allow me to get a journalctl log of the whole process (will attach that).

I understand that optimus is a nasty mess.  But, frustratingly, things work just fine with the fedora 4.14.* series, and just fine with your debugging tree kernel.  As a user I suppose I can always go back to the bad old days of hand installing kernels rather than just loading them from the yum repository.  Last time I did that regularly was going on 20 years ago :)

The fact that the errors happen with either fedora's 4.15.* tree or their 4.16 devel tree is telling us that something in the deltas they apply to their kernel build is the source of the problem.  I suppose at this point, I go back to the original fedora bugzilla and put the ball back in their court?  bisecting their configs and patches sounds like the next (tedious) step.

Before I go, though: what are these "flip timeout" errors?  If I had to guess, any number of things horribly wrong in there could make the i915 module get into a funny state that eventually throws this error: but any insight as I move forward would be appreciated.

Comment 23 Alec Habig 2018-04-27 14:08:40 UTC

Created attachment 139175 [details]
full journalctl -b -1 -l dump obtained after being able to issue a clean reboot via sssh.

Comment 24 Alec Habig 2018-04-27 14:53:09 UTC

I just asked the Fedora bugzilla about what patches they're putting on to their rpms, and they suggested:

  It's probably the opposite, there is something in the Free Desktop tree that
  hasn't yet made it into Linus' tree. If you can narrow down which patch it is
  we can bring it into Fedora.

Comment 25 Jani Saarinen 2018-05-04 12:26:58 UTC

Alec, any updates from you on this?

Comment 26 Alec Habig 2018-05-04 13:08:05 UTC

No update since the Fedora suggestion above.

How different is your development tree from the main kernel release?

Comment 27 Jani Saarinen 2018-05-04 13:50:00 UTC

Jani, like to comment or add something to https://01.org/linuxgraphics/gfx-docs/maintainer-tools/drm-intel.html?

Comment 28 Alec Habig 2018-05-10 15:55:33 UTC

Interesting development here.  I decided to try the latest drm-tip (4.17.0-rc4) and see what happens.  It died: but looking at it harder, it seems my working drm-tip (4.16.0-rc7) had a much simpler .config file.  I think the difference is I compiled the new one as root, which picked up all the config settings from my currently running fedora 4.14 kernel, while doing it as a normal user for 4.16 did not.

Anyway - copying over the 4.16 .config into the 4.17 build area, and being conservative with all the new config optins the 4.17 build presents, it does boot again: although it does spend 30+seconds at the boot flailing: before succeeding rather than failing.

So: something in fedora's kernel config is triggering my problem.  So I can try bisecting kernel options till I find the specific troublemaker.  There are a lot of them though.

But: it's not just that.  Hopefully the additional debugging info in the 4.17 dmesg (attached) offer clues.  You can see the drm module having problems: I do not know the severity of the problems, just that it did the same "spend 30-odd seconds early in the boot timing out" problem as it does when it fails,even though it then succeeds (with a simpler .config).

Comment 29 Alec Habig 2018-05-10 15:56:45 UTC

Created attachment 139469 [details]
dmesg from a successful boot with drm-tip 4.17.0-rc4, and a simpler .config.

Comment 30 Jani Nikula 2018-05-11 08:36:40 UTC

Per logs you have selftests enabled:

[    0.377402] drm_mm: Testing DRM range manger (struct drm_mm), with random_seed=0x9292e37 max_iterations=8192 max_prime=128
[    0.377756] drm_mm: igt_sanitycheck - ok!
[    0.377962] igt_debug 0x0000000000000000-0x0000000000000200: 512: free
[    0.377963] igt_debug 0x0000000000000200-0x0000000000000600: 1024: used
[    0.377963] igt_debug 0x0000000000000600-0x0000000000000a00: 1024: free
[    0.377964] igt_debug 0x0000000000000a00-0x0000000000000e00: 1024: used
[    0.377965] igt_debug 0x0000000000000e00-0x0000000000001000: 512: free
[    0.377966] igt_debug total: 4096, used 2048 free 2048

Use

CONFIG_DRM_DEBUG_SELFTEST=n
CONFIG_DRM_I915_DEBUG=n

Comment 31 Alec Habig 2018-05-14 20:12:19 UTC

Yes, that self-test thing was the delay early in boot with the simply-configured drm-tip kernel.  Thanks!

Working on testing various kernel configs to see which combination is the lethal one.  This will take a while...  (although less now that it boots faster :)

Comment 32 btmckee9 2018-08-21 04:51:38 UTC

I don't know if this will help, but I have an old Centrino i5 mobile running 4.18.1 as a media pc and if I connect an external monitor and boot I get a hard crash of the PC. I managed to get a complete dmesg just before the hang.

This PC is intel only, not dual video.

https://pastebin.com/raw/g5n2J90k

I'm using this PC at the moment, on the external monitor which works fine if I plug it in after X has started.

I'll try to keep tabs on this thread.

This is a Gentoo install.

Comment 33 Lakshmi 2018-09-12 08:34:45 UTC

Alec, do you still have the boot failure?

Comment 34 Alec Habig 2018-09-12 13:50:22 UTC

I do: even with the current Fedora Kernel (4.17.19) and the latest Lenovo BIOS (which got updated recently, presumably for spectre mitigation).  Just verified this when booting this morning (accidentally chose the wrong kernel in grub).

I have made little progress on narrowing down which kernel features cause the problem when enabled.  Your vanilla testing kernel works, the same version of the fedora configured kernel does not.  But there's a lot of parameter space to search, each "tweak/compile/boot/fail" iteration takes an hour, and this is the laptop I use for actual work.  So slow progress :(

Comment 35 Alec Habig 2018-09-21 15:33:23 UTC

So since last week, kernel-4.18.7-100.fc27.x86_64 came out on the Fedora updates repo.  Tried it just now in the process of looking at the next batch of options differences... and it didn't crash!  And all the other drivers I need on this laptop are enabled (which wasn't the case with your drm-tip kernel).

Typing at you now in an X session while running the new kernel and its video modules.

Sometimes bigger numbers really are better :)

Will go read release notes and changelogs to see if there are clues to what might have changed (for the good this time), and update this bug for future reference if there's news.

Comment 36 Alec Habig 2018-09-21 18:41:21 UTC

ok, all's not wine and roses.  When exiting X, it goes into the same style lockup as used to happen on boot.  Which includes crashing so hard that often times (but not all the time) the BIOS can't find the boot disk after the next power cycle.

At least the bug moving further back in the chain means I can do some work, and investigate a running system.  Thank goodness I didn't uninstall the (still happily working) kernel-4.14.18-300.fc27.x86_64

Comment 37 Lakshmi 2018-10-21 18:43:07 UTC

(In reply to Alec Habig from comment #36)
> ok, all's not wine and roses.  When exiting X, it goes into the same style
> lockup as used to happen on boot.  Which includes crashing so hard that
> often times (but not all the time) the BIOS can't find the boot disk after
> the next power cycle.
> 
> At least the bug moving further back in the chain means I can do some work,
> and investigate a running system.  Thank goodness I didn't uninstall the
> (still happily working) kernel-4.14.18-300.fc27.x86_64

Alec, sorry for the delay...
Do you have the dmesg from boot? Attached logs are old. Have you verified wtih latest drm-tip? Please attach the latest dmesg from boot with kernel parameters drm.debug=0x1e log_buf_len=4M.

Comment 38 Alec Habig 2019-01-17 02:55:09 UTC

Sorry for the long silence. Going has been slow: it takes a while to do a tweak/test/reboot/hang cycle. Especially since the disk controller gets munged upon the bug happening, even persisting across a power cycle (!) about 50% of the time (I have no idea how such a thing is even possible). And, this device is my "get work done" laptop, using the 4.14 kernel which is perfectly happy.

Anyway: tried a different approach. Threw in a spare disk, and tried a clean install of F29 and Kubuntu 18.04.1, in case cruft was to blame.

F29 installed from dvd... but the kernel installed to disk didn't boot once the first install pass finished.
Kubuntu's install dvd wouldn't even boot past the spinny "it's starting to boot" graphics screen.

In both cases, since things are "graphical boot" and "quiet", there's no feedback as to exactly what went wrong, and the system hangs too hard to switch to a different vtty, but it felt the same as the hangs described above.

Trying with the latest/last F27 kernel (yes, it's EOL now) 4.18.19-100, tried all permutations of integrated, discrete, and optimus bios settings, excluding nouveau drivers, and the nvidia proprietary blob. Sometimes it can get a clean boot - but then the sata controller goes out to lunch as soon as a write cache flush happens. Which makes me think the kms problem which started this thread is a symptom rather than the problem, just the one which usually triggers first. This is consistent with the problems getting log traces of the problems described above, because a lunched sata controller can't log errors.

Went back to drm-tip. See that it's kernel 5.0 now, cool.

This minimally configured kernel continues to work. I've enabled the extra features needed to run the laptop, no problems.

So: the bug is at the very least triggered by one of the (myriad) of enabled kernel options in the distro stock kernels. It feels to me like the old days of unprotected flat memory space where you could POKE random values into random addresses and watch the system fall apart: with the initial kms call being the most sensitive to it, and things unravel into thrashing the disk controller.

Parameter space is too vast for me to find the culprit with intermittent effort and a logging system that's often the first victim of the bug. So, I'm ready to punt, documenting this here in case someone else with more clues googles it is the only remaining thing I can do.

Time to just return to the 1990's and compile my own kernel :( At least git now makes tracking updates easier than it used to be in the Bad Old Days.

Comment 39 Alec Habig 2019-03-10 14:51:55 UTC

Just accidentally booted into fedora's 4.20.13-200.fc29.x86_64  (have been happily running the drm-tip)... and it worked!

Will try reading changelogs to get some clues.

Comment 40 Alec Habig 2019-04-04 18:03:36 UTC

nuts.  Going to 4.20.15-200.fc29.x86_64 failed again... and backing up to 4.20.13-200 also failed.  Tricked by a race condition into thinking that that version actually worked?  Bad interaction with some other update applied since?  Dunno.

drm-tip (now at 5.1-rc3) continues to work.

Fedora 29 is now at 5.0.5-200... yep, same problems as always.

Nothing in the changelogs lights me up as an obvious thing, but there's a lot there and I'm not a kernel expert.

At least I can compile my own from drm-tip's clean tree and get work done.

Comment 41 Jani Nikula 2019-04-18 08:52:20 UTC

drm-tip works, nothing in dmesg to indicate failure. Pretty much the only way to get progress is to get a bisect result.

Comment 42 Lakshmi 2019-06-17 07:06:22 UTC

Alec, any updates here? Did you get any chance to bisect the results?
Priority is updated as the issue isn't in drmtip.

Comment 43 Alec Habig 2019-06-17 14:22:29 UTC

Bisecting is a manual process, since it's some option (or combination thereof) enabled in Fedora's .config that's the problem rather than a bad commit which git can help with.

You're right in that it's not drm-tip itself.  In fact, while the original weirdness manifested itself in drm, it also trashed the disk controller.  In the last couple stock fedora kernels (5.1.6 and 5.1.8), the problem has shifted to the keyboard controller (or maye it's usb, haven't disentabled it).  So something in there is stomping on things it shouldn't, and for the longest time, that was the video driver.

Since I'm now pretty certain it's not your bug, closing this one is fine: I'll keep poking at it till I solve it, or until I finally upgrade this T430, whichever comes first :)

Comment 44 Lakshmi 2019-07-09 13:51:03 UTC

Closing this issue as NOTOUTBUG.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.