Bug 101368 - Nouveau regression GT218M in Kernel 4.11 Won't Boot
Summary: Nouveau regression GT218M in Kernel 4.11 Won't Boot
Status: RESOLVED FIXED
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/nouveau (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Nouveau Project
QA Contact: Xorg Project Team
URL:
Whiteboard:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2017-06-10 04:40 UTC by Ben Steel
Modified: 2017-06-26 11:22 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg output from the earlier 4.10 kernel showing correct functioning (66.62 KB, text/plain)
2017-06-10 04:40 UTC, Ben Steel
no flags Details
dmesg after rmmod and insmod of nouveau (66.66 KB, text/plain)
2017-06-11 19:33 UTC, Ben Steel
no flags Details
dmesg with debug=trace (76.63 KB, text/plain)
2017-06-13 20:41 UTC, Ben Steel
no flags Details

Description Ben Steel 2017-06-10 04:40:35 UTC
Created attachment 131838 [details]
dmesg output from the earlier 4.10 kernel showing correct functioning

Kernel 4.11.3-1 on a Lenovo Thinkpad T510 requires nomodeset to boot. Kernel 4.10.13-1 was fine. The T510 uses an NVIDIA Quadro NVS 3100M with 512MB. The OS is OpenSUSE Tumbleweed (a semi-tested rolling release for the brave). A dmesg from the working 4.10 kernel is attached.
Downstream bug report is bugzilla.opensuse.org #1043280.
Comment 1 Ilia Mirkin 2017-06-10 11:58:55 UTC
Any particular reason you're blaming nouveau?

From the 4.10 log, it appears that i915 is the primary drm driver.

What does "won't boot" mean?

What happens if you specify 'nouveau.modeset=0' (which has the effect of disabling nouveau entirely but leaving i915 as it was)?
Comment 2 Ben Steel 2017-06-10 18:03:26 UTC
Thanks for the prompt response.

nouveau.modeset=0 allows it to successfully boot all the way into the GUI.

In successful 4.10 kernel boots, nouveau console messages are present. In unsuccessful 4.11 boots with console logging, the console messages would stop and the screen would experience bad scrolling (redraw problems) at unpredictable points in the list but never mentioning nouveau. I know that it didn't complete booting without the screen because the ssh daemon never came up, dashing my hopes for better logging.
Comment 3 Ilia Mirkin 2017-06-10 19:55:15 UTC
OK, so after booting with nouveau.modeset=0, rmmod nouveau, and reinsert it without that option. As your system should be fully up, you should get the relevant info of what fails.
Comment 4 Ben Steel 2017-06-11 19:33:24 UTC
Created attachment 131881 [details]
dmesg after rmmod and insmod of nouveau

Tried the command 
  sudo rmmod -v nouveau 
followed by
  sudo insmod /lib/modules/4.11.3-1-default/kernel/drivers/gpu/drm/nouveau/nouveau.ko 

X was not running and I was on console. In each case stdout and stderr were redirected, but results were zero-length, so omitted for clarity.
Results of insmod:
Visibly, the four dmesg items from 1763.415095 to 1763.417364 were shown, the dispay froze up and the fan turned on, blowing very warm air. SSH was very slow, taking around a minute to get a prompt. Each command typed could take a similar amount of time. A shutdown command was never able to complete. Dmesg output during the ssh session after insmod is attached. Hope this helps.
Comment 5 Ilia Mirkin 2017-06-11 23:08:43 UTC
Can you see if this patch helps any?

https://github.com/skeggsb/linux/commit/a7cb78bab3671dbad08e5b2f5fd83a6dbda90fe5

If not, please boot with nouveau.debug=trace as well.
Comment 6 Ben Steel 2017-06-13 20:41:13 UTC
Created attachment 131935 [details]
dmesg with debug=trace

Thanks for your efforts. Boot with patched driver unsuccessful. Dmesg from rmmod and insmod with debug trace attached. Thanks also to Takashi Iwai for the build of the patched kernel.
Comment 7 Ilia Mirkin 2017-06-13 21:07:01 UTC
Interesting. Dies in PMU preinit somewhere, which in turn calls nkvm_pmu_reset. I don't see an obvious reason for that to die except ... if PTIMER is somehow off?

The only difference is from 1e2115d8c0c0da62405400316f5499d909e479bc which makes it so that nvkm_falcon_v1_new is now being called, although I can't imagine what would go wrong there.

This will require someone who knows what they're doing to figure out ... i.e. not me. Hopefully Ben can take a look.
Comment 8 Ben Skeggs 2017-06-14 00:31:08 UTC
I investigated an issue a while back that I believe is likely the same as what you're experiencing.  I'm, unfortunately, not able to reproduce properly (long story) on any hardware I own.

I've identified a few issues that could result in what you're seeing, however, I have no idea why they've suddenly become a problem in 4.11.  There doesn't seem to be an obvious commit that's the culprit here, so I can't be sure any of the issues I found will actually resolve the problem.

If you would you be able to bisect between 4.10 and 4.11, and determine the exact commit where this starts happening, that'd be a big help.  It's potentially not even a nouveau commit that's triggered this.

Thanks,
Ben.
Comment 9 Manuel Coenen 2017-06-20 13:17:05 UTC
I have the same issue on an Asus PL80Jt which has the same Nvidia Chip built in. Using modprobe.blacklist=nouveau let's me but just fine (except that PRIME won't be enabled and the card will not be disabled by vgaswitcheroo). Otherwise I experience what has been described here.

I compiled kernel 4.12-rc6 this morning and it boots fine again. So this will be fixed in 4.12. My machine is too slow to do a bisect (kernel compiles several hours) so I cannot provide you with the information what broke it originally.

Best regards
Manuel
Comment 10 Ben Steel 2017-06-21 16:25:28 UTC
Thank you Mr. Coenen. I agree. Kernel 4.12.0-rc6 does not exhibit the problem on a Lenovo T510 either.
Between a fix already in the pipeline and a modeset workaround in the meantime, I'm happy. Does anyone still need the bisection completed?
Comment 11 Ben Skeggs 2017-06-23 01:11:20 UTC
(In reply to Ben Steel from comment #10)
> Thank you Mr. Coenen. I agree. Kernel 4.12.0-rc6 does not exhibit the
> problem on a Lenovo T510 either.
> Between a fix already in the pipeline and a modeset workaround in the
> meantime, I'm happy. Does anyone still need the bisection completed?

I, personally, wouldn't mind knowing what the cause was exactly, and what fixed it.  It could be important down the line if it ever reappears.
Comment 12 Ben Steel 2017-06-23 16:00:31 UTC
Yikes. I only waited about 28 hours before completely restoring that machine from backup and putting it back in service. Food for thought: my greatest fear is that the problem may have been illuminated by the change to GCC 7, which doesn't like compiling 4.10 kernels.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.