Bug 101665

Summary: lspci blocks forever with a GP107M
Product: xorg Reporter: Kenneth Graunke <kenneth>
Component: Driver/nouveauAssignee: Nouveau Project <nouveau>
Status: NEW --- QA Contact: Xorg Project Team <xorg-team>
Severity: normal    
Priority: medium CC: andrey+freedesktop, eurbah, jan.public, kai.heng.feng, peter, rhyskidd
Version: unspecified   
Hardware: Other   
OS: All   
See Also: https://bugs.freedesktop.org/show_bug.cgi?id=100228
https://bugs.freedesktop.org/show_bug.cgi?id=104621
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
lspci for GP107M [GeForce GTX 1050 Ti Mobile]
none
dmesg from thinkpad x1 extreme / GTX 1050 Ti on kernel 4.20 with hybrid graphics
none
dmesg from thinkpad x1 extreme / GTX 1050 Ti on kernel 4.20 with discrete only graphics none

Description Kenneth Graunke 2017-06-30 17:13:52 UTC
Hello,

I have a Dell XPS 15 9560 with a GTX 1050 Mobile (GP107M) and an Intel Kabylake CPU.  Using Kernel 4.11.3 on the Arch installer image, whenever I run 'lspci', it blocks indefinitely and never prints any output.

After running lspci, dmesg contains:

[   54.819264] nouveau 0000:01:00.0: Refused to change power state, currently in D3
[   54.879968] nouveau 0000:01:00.0: Refused to change power state, currently in D3
[   54.879973] nouveau 0000:01:00.0: Refused to change power state, currently in D3
[   54.879974] nouveau 0000:01:00.0: DRM: resuming object tree...

Eventually the scheduler gets cranky about the hung process and starts spewing
[  245.385522] INFO: task lspci:576 blocked for more than 120 seconds.
and a backtrace every 2 minutes.

Blacklisting nouveau makes lspci work.
Comment 1 Ilia Mirkin 2017-06-30 18:08:43 UTC
I believe including a full dmesg (from boot) would be helpful to show what may be going wrong.

Note that running with nouveau.runpm=0 will prevent the suspend from happening. However that will, of course, cause the GPU to remain on. [I believe it will remain on without nouveau loading as well, but with this new PCIe PM stuff, I'm not sure anymore.]
Comment 2 Peter Wu 2017-07-04 12:22:14 UTC
With the "new PCIe PM stuff", if nouveau is not loaded and something else enabled automatic runtime PM (via powertop, via TLP or manually by writing "auto" to /sys/bus/pci/devices/.../power/control) for the Nvidia PCI devices, then indeed the problematic ACPI methods could be triggered.

Kenneth, can you upload your acpidump?
sudo pacman -S acpidump && sudo acpidump > acpidump.txt

Most likely you are affected by
https://bugzilla.kernel.org/show_bug.cgi?id=156341
Comment 3 Karol Herbst 2017-07-10 20:44:39 UTC
this is because lspci reads the config file, which then triggers a full GPU wake up, which is a silly thing to do in the first place.

What we need is something like this in the kernel: https://github.com/karolherbst/linux/commit/cb918e4c926990dfcfce92e1ecd905e0896de605 and then make use of those in userspace, so that we don't need to read config every time anymore.
Comment 4 Etienne URBAH 2017-11-13 18:39:53 UTC
I am trying to use 'nouveau' with GP107M [GeForce GTX 1050 Ti Mobile].

With Linux Kernel 4.13.0-16 from Ubuntu 17.10 Artful, 'lspci' systematically makes immediately 1 CPU freeze.

Therefore, I am testing Linux kernels from http://kernel.ubuntu.com/~kernel-ppa/mainline

With Linux kernel 4.14.0-rc7, this issue does NOT show up :

$ lspci -nn -v -s 1:0
01:00.0 3D controller [0302]: NVIDIA Corporation GP107M [GeForce GTX 1050 Ti Mobile] [10de:1c8c] (rev a1)
	Subsystem: Micro-Star International Co., Ltd. [MSI] GP107M [GeForce GTX 1050 Ti Mobile] [1462:11c8]
	Flags: bus master, fast devsel, latency 0, IRQ 134
	Memory at de000000 (32-bit, non-prefetchable) [size=16M]
	Memory at c0000000 (64-bit, prefetchable) [size=256M]
	Memory at d0000000 (64-bit, prefetchable) [size=32M]
	I/O ports at e000 [size=128]
	Expansion ROM at df000000 [disabled] [size=512K]
	Capabilities: <access denied>
	Kernel driver in use: nouveau
	Kernel modules: nvidiafb, nouveau

With Linux kernel 4.14.0-rc8, 'lspci' systematically makes immediately the whole computer freeze.

With Linux kernel 4.14.0 (released yesterday), 'lspci' systematically fails to answer, and makes the whole computer freeze after some time.

So, there is probably a regression.

I have also reported this issue at https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1729736
Comment 5 Etienne URBAH 2017-11-21 23:21:53 UTC
With following Linux kernels, 'lspci' systematically fails to answer, and makes the whole machine immediately freeze :
- 4.13.0-17 from Ubuntu 17.10 (Artful)
- 4.14.1 from http://kernel.ubuntu.com/~kernel-ppa/mainline
Comment 6 Etienne URBAH 2017-11-27 20:41:09 UTC
With Linux kernel 4.15.0-041500rc1 from http://kernel.ubuntu.com/~kernel-ppa/mainline, 'lspci' systematically fails to answer, and makes the whole machine freeze after some time.
Comment 7 Etienne URBAH 2017-12-04 21:22:18 UTC
With Linux kernel 4.15.0-041500rc2 from http://kernel.ubuntu.com/~kernel-ppa/mainline, 'lspci' systematically fails to answer, and makes the whole machine freeze after some time.
Comment 8 Etienne URBAH 2017-12-11 23:45:25 UTC
With Linux kernel 4.15.0-041500rc3 from http://kernel.ubuntu.com/~kernel-ppa/mainline, 'lspci' systematically fails to answer, and makes the whole machine freeze after some time.

Besides, inside 'kern.log', I have detected following messages :
nouveau 0000:01:00.0: DRM: BIT table 'A' not found
nouveau 0000:01:00.0: DRM: BIT table 'L' not found
nouveau 0000:01:00.0: DRM: Pointer to TMDS table invalid
Comment 9 Pierre Moreau 2017-12-12 09:05:12 UTC
@Étienne Could you please provide the information that was asked in comment #1 and comment #2 of this bug report? Adding `nouveau.runpm=0` to the kernel command line should avoid the freeze but will prevent the NVIDIA card from being suspended.
Looking at the bug report mentioned in comment #2, could you try booting with `acpi_osi=! acpi_osi="Windows 2009"` and/or `acpi_rev_override=5` on the kernel command line (without `nouveau.runpm=0`)?
Comment 10 Etienne URBAH 2017-12-12 15:29:49 UTC
Created attachment 136105 [details]
lspci for GP107M [GeForce GTX 1050 Ti Mobile]

Lot of thanks to Pierre Moreau for his suggestions of options in the kernel command line :

-  Adding just 'nouveau.runpm=0' prevents the whole machine to freeze, but 'nouveau' FAILS to manage an external display with resolution 3840 x 2160 at 60Hz through DisplayPort.

-  Adding just 'acpi_rev_override=5' does NOT prevent the whole machine to freeze.

-  Adding just 'acpi_osi=! acpi_osi="Windows 2009"' permits 'lspci' to succeed, and 'nouveau' to successfully manage an external display with resolution 3840 x 2160 at 60Hz through DisplayPort.
Comment 11 Etienne URBAH 2018-03-17 02:39:40 UTC
With Linux kernel 4.16.0-041600rc5 from http://kernel.ubuntu.com/~kernel-ppa/mainline, 'lspci' systematically fails to answer, and makes the whole machine freeze after some time.
Comment 12 Etienne URBAH 2018-04-18 18:08:09 UTC
With Linux kernel 4.17.0-041700rc1 from http://kernel.ubuntu.com/~kernel-ppa/mainline, graphical login fails, and the machine is frozen.
Comment 13 Etienne URBAH 2018-04-18 18:41:09 UTC
With Linux kernel 4.17.0-041700rc1 from http://kernel.ubuntu.com/~kernel-ppa/mainline, I confirm that systematically :

-  Inside a Linux console, 'lspci' fails to answer, and makes the whole machine immediately freeze.

-  Graphical login fails, and makes the whole machine immediately freeze.
Comment 14 Loris Z. 2018-05-14 11:34:00 UTC
I'm experiencing the same issue on an XPS 9560 with Ubuntu 18.04 (same symptoms, same dmesg output).
Any new information required?
Comment 15 Etienne URBAH 2018-06-19 12:30:46 UTC
With Linux kernel 4.18.0-041800rc1 from http://kernel.ubuntu.com/~kernel-ppa/mainline :

I tried to open a Linux console with Ctrl Alt F2, but did NOT succeed.

Systematically, graphical login fails, and makes the whole machine immediately freeze.
Comment 16 Andrey Melentyev 2019-01-09 09:53:04 UTC
Created attachment 143031 [details]
dmesg from thinkpad x1 extreme / GTX 1050 Ti on kernel 4.20 with hybrid graphics

I may be blind but I haven't seen anyone attaching full dmesg output, sorry if I missed it.

The laptop in question is Thinkpad X1 Extreme with NVIDIA Corporation GP107M [GeForce GTX 1050 Ti Mobile] (rev a1). It has a BIOS setting to switch between "Hybrid" and "Discrete only" graphics. 

This dmesg output is from a boot with "Hybrid" graphics, where running 'lspci' hangs and causes the system fans to spin. X server hangs on start too (using modesetting DDX)
Comment 17 Andrey Melentyev 2019-01-09 09:54:58 UTC
Created attachment 143032 [details]
dmesg from thinkpad x1 extreme / GTX 1050 Ti on kernel 4.20 with discrete only graphics

Not sure if that helps with the diagnostics, but on the same Thinkpad X1 Extreme laptop with "Discrete only" BIOS setting, lspci works fine, X starts and works, but there's a timeout logged by nouveau.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.