Bug 94725

Summary: Nouveau driver fails to poweron GPU on GM204 after dynamic poweroff
Product: xorg Reporter: Rashed Abdel-Tawab <rashed>
Component: Driver/nouveauAssignee: Nouveau Project <nouveau>
Status: RESOLVED MOVED QA Contact: Xorg Project Team <xorg-team>
Severity: normal    
Priority: medium CC: efremmc2, fabrice, gnurou, perry_yuan, peter
Version: unspecified   
Hardware: Other   
OS: Linux (All)   
See Also: https://bugzilla.kernel.org/show_bug.cgi?id=156341
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg from boot
none
dmesg while crashing
none
Tentative fix
none
dmesg using tentative fix
none
attachment-28993-0.html
none
_msg_kernel_pci-23.txt
none
new dmesg during hang
none
nouveau run on GM170 with default runpm
none
Schenker XMD A506 acpidump none

Description Rashed Abdel-Tawab 2016-03-27 16:28:08 UTC
Created attachment 122587 [details]
dmesg from boot

Booting Linux 4.6-rc1 and Mesa 11.2/11.3 fails to load the nouveau driver on a GTX 970M (6GB) on an MSI GS60 Ghost Pro 4K (i7-6700HQ). It spews out wonderful messages like
[    2.146398] nouveau 0000:01:00.0: priv: HUB0: 10ecc0 ffffffff (1940822c)
[    2.154362] vga_switcheroo: enabled
[    2.154567] [TTM] Zone  kernel: Available graphics memory: 8170764 kiB
[    2.154568] [TTM] Zone   dma32: Available graphics memory: 2097152 kiB
[    2.154569] [TTM] Initializing pool allocator
[    2.154572] [TTM] Initializing DMA pool allocator
[    2.154577] nouveau 0000:01:00.0: DRM: VRAM: 6144 MiB
[    2.154578] nouveau 0000:01:00.0: DRM: GART: 1048576 MiB
[    2.154580] nouveau 0000:01:00.0: DRM: Pointer to TMDS table invalid
[    2.154582] nouveau 0000:01:00.0: DRM: DCB version 4.1
[    2.154583] nouveau 0000:01:00.0: DRM: Pointer to flat panel table invalid

Attached is a dmesg from boot. The driver does just drop to the i915 driver so the machine is usable, but whenever I run lspci or lshw or try to logout of the X session, it hangs when it switches back to the nVidia GPU (the laptop has an LED indicator showing which GPU is in use)
Comment 1 Pierre Moreau 2016-03-27 17:23:37 UTC
Nouveau is successfully loaded on your laptop, but it seems to fail when it tries to wake up the NVIDIA GPU (if you look at the dmesg you linked, around 11sec, the NVIDIA GPU goes to sleep). You could try booting with `nouveau.runpm=0` on the kernel command line, and see if you still get the issue.
Do you have any dmesg from when it hangs?

IIRC, Alexandre Courbot sent a patch some time ago to fix an issue where the driver would try to reload the signed firmware upong resume and fail, but I would have guess it is included in 4.6-rc1.
Comment 2 Ilia Mirkin 2016-03-27 17:35:46 UTC
In addition to the runpm=0 thing, please ensure that you have the appropriate firmware installed for this GPU - it should be in linux-firmware.git by now (nvidia/*). I don't see a message about nouveaufb, which could be due to how you configured your kernel, but it could also be because you don't have the firmware, and the user helper is kicking in and waiting 60 seconds for it to fail out, so nouveau's not fully done loading by the time the runpm stuff kicks in. Just a theory.
Comment 3 Rashed Abdel-Tawab 2016-03-27 17:50:44 UTC
(In reply to Ilia Mirkin from comment #2)
> In addition to the runpm=0 thing, please ensure that you have the
> appropriate firmware installed for this GPU - it should be in
> linux-firmware.git by now (nvidia/*). I don't see a message about nouveaufb,
> which could be due to how you configured your kernel, but it could also be
> because you don't have the firmware, and the user helper is kicking in and
> waiting 60 seconds for it to fail out, so nouveau's not fully done loading
> by the time the runpm stuff kicks in. Just a theory.

I have the gm20x firmware from the linux-firmware repo installed.

(In reply to Pierre Moreau from comment #1)
> Nouveau is successfully loaded on your laptop, but it seems to fail when it
> tries to wake up the NVIDIA GPU (if you look at the dmesg you linked, around
> 11sec, the NVIDIA GPU goes to sleep). You could try booting with
> `nouveau.runpm=0` on the kernel command line, and see if you still get the
> issue.
> Do you have any dmesg from when it hangs?

I'll try that in a bit as well as try to get a dmesg when it hangs (not at my computer ATM) I see "NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s!" when it hangs during a logout/shutdown but that's not particularly helpful.
Comment 4 Rashed Abdel-Tawab 2016-03-28 02:53:31 UTC
Created attachment 122591 [details]
dmesg while crashing

Here is the dmesg from when it crashes. I ran lshw and it seems that triggered the nVidia card to start back up which caused the crash. With runpm=0 the nVidia card is never powered off so it doesn't crash.
Comment 5 Efrem McCrimon 2016-03-30 22:22:55 UTC
I have the same problem with a GM206, GTX 960. I have to recycle the computer twice.
Comment 6 Alexandre Courbot 2016-04-01 03:13:31 UTC
Maybe these messages are pointing to the root of the problem:

[   51.608479] nouveau 0000:01:00.0: Refused to change power state, currently in D3
[   51.683924] nouveau 0000:01:00.0: Refused to change power state, currently in D3
[   51.700020] nouveau 0000:01:00.0: Refused to change power state, currently in D3

If the device is still in D3 when we resume it, then accessing registers would understandably result in a freeze. Devinit comes early enough in the resume chain to make this plausible.

FWIW I can successfully suspend/resume (echo mem >/sys/power/state) a GTX 960, but runtime PM works slightly differently.

I would like to enable runtime PM on my desktop GTX960 to repro this, but for some reason I am failing - despite loading nouveau with "modeset=2 runpm=1", I cannot see runtime PM kicking in and /sys/class/drm/card0/power/runtime_status says "unsupported". What am I doing wrong?
Comment 7 Dave Airlie 2016-04-01 03:22:04 UTC
can you try booting with acpi_osi="!Windows 2013" on the kernel command line.
Comment 8 Alexandre Courbot 2016-04-01 06:47:11 UTC
Created attachment 122653 [details]
Tentative fix

The attached patch *might* help with this issue, but I have no way to test it.

Rashed, Efrem, can one of you give it a try and tell us if it helps?
Comment 9 Alexandre Courbot 2016-04-01 06:52:25 UTC
Dave, I tried adding the option you suggested, but it did not allow me to enable runtime PM, sadly. /sys/class/drm/card0/power/runtime_status still "unsupported" despite nouveau.ko being loaded with "modeset=2 runpm=1".
Comment 10 Rashed Abdel-Tawab 2016-04-01 07:52:47 UTC
Created attachment 122654 [details]
dmesg using tentative fix

Alexandre, using the tentative fix you uploaded it switches GPUs properly now. Here is the dmesg from that since there are still some errors related to power state in it. Also, for some reason lshw is returning this for the nVidia GPU now:

*-generic
                description: Unassigned class
                product: Illegal Vendor ID
                vendor: Illegal Vendor ID
                physical id: 0
                bus info: pci@0000:01:00.0
                version: ff
                width: 32 bits
                clock: 66MHz
                capabilities: bus_master vga_palette cap_list rom
                configuration: driver=nouveau latency=255 maxlatency=255 mingnt=255
                resources: irq:129 memory:dc000000-dcffffff memory:b0000000-bfffffff memory:c0000000-c1ffffff ioport:e000(size=128) memory:dd000000-dd07ffff

I don't know if that's related to this at all but before, if I set runpm=0 and run lshw, it would return the proper description (running it without runpm would cause the system to hang)
Comment 11 Alexandre Courbot 2016-04-01 08:02:25 UTC
Thanks Rashed. This looks better but something seems to be going wrong with PCI. I'm pretty clueless about PCI/ACPI, so let's see if someone else has something to suggest...
Comment 12 Efrem McCrimon 2016-04-01 13:04:57 UTC
Created attachment 122662 [details]
attachment-28993-0.html

I will not be able to test until later this afternoon. I have a GTX 960 as
PCI 01:00.0 and GTX 730 as PCI 02:00.0.
I will send over dmesg/journalctl -k.

Regards
Efrem
On Apr 1, 2016 4:02 AM, <bugzilla-daemon@freedesktop.org> wrote:

> *Comment # 11 <https://bugs.freedesktop.org/show_bug.cgi?id=94725#c11> on
> bug 94725 <https://bugs.freedesktop.org/show_bug.cgi?id=94725> from
> Alexandre Courbot <gnurou@gmail.com> *
>
> Thanks Rashed. This looks better but something seems to be going wrong with
> PCI. I'm pretty clueless about PCI/ACPI, so let's see if someone else has
> something to suggest...
>
> ------------------------------
> You are receiving this mail because:
>
>    - You are the assignee for the bug.
>
>
> _______________________________________________
> Nouveau mailing list
> Nouveau@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/nouveau
>
>
Comment 13 Karol Herbst 2016-04-01 15:17:34 UTC
(In reply to Rashed Abdel-Tawab from comment #10)
> Created attachment 122654 [details]
> dmesg using tentative fix
> 
> Alexandre, using the tentative fix you uploaded it switches GPUs properly
> now. Here is the dmesg from that since there are still some errors related
> to power state in it. Also, for some reason lshw is returning this for the
> nVidia GPU now:
> 
> *-generic
>                 description: Unassigned class
>                 product: Illegal Vendor ID
>                 vendor: Illegal Vendor ID
>                 physical id: 0
>                 bus info: pci@0000:01:00.0
>                 version: ff
>                 width: 32 bits
>                 clock: 66MHz
>                 capabilities: bus_master vga_palette cap_list rom
>                 configuration: driver=nouveau latency=255 maxlatency=255
> mingnt=255
>                 resources: irq:129 memory:dc000000-dcffffff
> memory:b0000000-bfffffff memory:c0000000-c1ffffff ioport:e000(size=128)
> memory:dd000000-dd07ffff
> 
> I don't know if that's related to this at all but before, if I set runpm=0
> and run lshw, it would return the proper description (running it without
> runpm would cause the system to hang)

I have the same issue with bbswitch (and maybe vgaswitcheroo too). Basically this means that in d3cold we can't talke to the gpu and the information isn't cached or something like that.
Comment 14 Efrem McCrimon 2016-04-05 01:25:24 UTC
Created attachment 122708 [details]
_msg_kernel_pci-23.txt

This what I captured from dmesg.

PCI 02:00.0 is a 730 GTX w/1Gb DDR5
PCI 01:00.0 is a 960 GTX w/4Gb DDR5


Apr 02 17:08:56 localhost kernel: nouveau 0000:01:00.0: bios: version
84.06.26.00.2c
Apr 02 17:08:56 localhost kernel: nouveau 0000:01:00.0: gr: using external
firmware
Apr 02 17:08:56 localhost kernel: nouveau 0000:01:00.0: Direct firmware
load for nvidia/gm206/fecs_inst.bin failed with error -2
Apr 02 17:08:56 localhost kernel: nouveau 0000:01:00.0: gr: failed to load
fecs_inst
Apr 02 17:08:56 localhost kernel: nouveau 0000:01:00.0: disp: dcb 15 type 8
unknown
Apr 02 17:08:56 localhost kernel: nouveau 0000:01:00.0: fb: 4096 MiB GDDR5
Apr 02 17:08:56 localhost kernel: nouveau 0000:01:00.0: DRM: VRAM: 4096 MiB
Apr 02 17:08:56 localhost kernel: nouveau 0000:01:00.0: DRM: GART: 1048576
MiB
Apr 02 17:08:56 localhost kernel: nouveau 0000:01:00.0: DRM: TMDS table
version 2.0
Apr 02 17:08:56 localhost kernel: nouveau 0000:01:00.0: DRM: DCB version 4.1
Apr 02 17:08:56 localhost kernel: nouveau 0000:01:00.0: DRM: DCB outp 00:
01000f02 00020030
Apr 02 17:08:56 localhost kernel: nouveau 0000:01:00.0: DRM: DCB outp 01:
02000f00 00000000
Apr 02 17:08:56 localhost kernel: nouveau 0000:01:00.0: DRM: DCB outp 02:
04011f82 00020030
Apr 02 17:08:56 localhost kernel: nouveau 0000:01:00.0: DRM: DCB outp 03:
02022f62 00020010
Apr 02 17:08:56 localhost kernel: nouveau 0000:01:00.0: DRM: DCB outp 05:
02833f76 04400020
Apr 02 17:08:56 localhost kernel: nouveau 0000:01:00.0: DRM: DCB outp 06:
02033f72 00020020
Apr 02 17:08:56 localhost kernel: nouveau 0000:01:00.0: DRM: DCB outp 15:
01df5ff8 00000000
Apr 02 17:08:56 localhost kernel: nouveau 0000:01:00.0: DRM: DCB conn 00:
00001030
Apr 02 17:08:56 localhost kernel: nouveau 0000:01:00.0: DRM: DCB conn 01:
01000131
Apr 02 17:08:56 localhost kernel: nouveau 0000:01:00.0: DRM: DCB conn 02:
00010261
Apr 02 17:08:56 localhost kernel: nouveau 0000:01:00.0: DRM: DCB conn 03:
00020346
Apr 02 17:08:56 localhost kernel: nouveau 0000:01:00.0: DRM: DCB conn 05:
00000570
Apr 02 17:08:56 localhost kernel: nouveau 0000:01:00.0: DRM: Pointer to
flat panel table invalid
Apr 02 17:08:56 localhost kernel: nouveau 0000:01:00.0: DRM: unknown
connector type 70
Apr 02 17:08:56 localhost kernel: nouveau 0000:01:00.0: DRM: failed to
create encoder 1/8/0: -19
Apr 02 17:08:56 localhost kernel: nouveau 0000:01:00.0: DRM: Unknown-1 has
no encoders, removing
Apr 02 17:08:57 localhost kernel: nouveau 0000:01:00.0: DRM: MM: using COPY
for buffer copies
Apr 02 17:08:57 localhost kernel: nouveau 0000:01:00.0: DRM: allocated
1920x1080 fb: 0x60000, bo ffff88089ac02800
Apr 02 17:08:57 localhost kernel: nouveau 0000:01:00.0: fb0: nouveaufb
frame buffer device
Apr 02 17:08:57 localhost kernel: nouveau 0000:02:00.0: enabling device
(0000 -> 0003)
Apr 02 17:08:57 localhost kernel: nouveau 0000:02:00.0: NVIDIA GK208B
(b06070b1)
Apr 02 17:08:57 localhost kernel: nouveau 0000:02:00.0: bios: version
80.28.78.00.01
Apr 02 17:08:58 localhost kernel: nouveau 0000:02:00.0: priv: HUB0: 086014
ffffffff (1f70820c)
Apr 02 17:08:58 localhost kernel: nouveau 0000:02:00.0: fb: 1024 MiB GDDR5
Apr 02 17:08:58 localhost kernel: nouveau 0000:02:00.0: DRM: VRAM: 1024 MiB
Apr 02 17:08:58 localhost kernel: nouveau 0000:02:00.0: DRM: GART: 1048576
MiB
Apr 02 17:08:58 localhost kernel: nouveau 0000:02:00.0: DRM: TMDS table
version 2.0
Apr 02 17:08:58 localhost kernel: nouveau 0000:02:00.0: DRM: DCB version 4.0
Apr 02 17:08:58 localhost kernel: nouveau 0000:02:00.0: DRM: DCB outp 00:
01000f02 00020030
Apr 02 17:08:58 localhost kernel: nouveau 0000:02:00.0: DRM: DCB outp 01:
02011f62 00020010
Apr 02 17:08:58 localhost kernel: nouveau 0000:02:00.0: DRM: DCB outp 02:
02022f10 00000000
Apr 02 17:08:58 localhost kernel: nouveau 0000:02:00.0: DRM: DCB conn 00:
00001031
Apr 02 17:08:58 localhost kernel: nouveau 0000:02:00.0: DRM: DCB conn 01:
00002161
Apr 02 17:08:58 localhost kernel: nouveau 0000:02:00.0: DRM: DCB conn 02:
00000200
Apr 02 17:08:58 localhost kernel: nouveau 0000:02:00.0: DRM: MM: using COPY
for buffer copies
Apr 02 17:08:58 localhost kernel: nouveau 0000:02:00.0: No connectors
reported connected with modes
Apr 02 17:08:59 localhost kernel: nouveau 0000:02:00.0: DRM: allocated
1024x768 fb: 0x60000, bo ffff88089a47c400
Apr 02 17:08:59 localhost kernel: nouveau 0000:02:00.0: fb1: nouveaufb
frame buffer device




Apr 02 17:09:01 localhost.localdomain kernel: mei_me 0000:00:16.0: enabling
device (0000 -> 0002)
Apr 02 17:09:01 localhost.localdomain kernel: snd_hda_intel 0000:01:00.1:
Disabling MSI
Apr 02 17:09:01 localhost.localdomain kernel: snd_hda_intel 0000:01:00.1:
Handle vga_switcheroo audio client
Apr 02 17:09:01 localhost.localdomain kernel: snd_hda_intel 0000:02:00.1:
Disabling MSI
Apr 02 17:09:01 localhost.localdomain kernel: snd_hda_intel 0000:02:00.1:
Handle vga_switcheroo audio client
Apr 02 17:09:02 localhost.localdomain kernel: snd_hda_intel 0000:00:1f.3:
failed to add i915 component master (-19)
Apr 02 17:09:02 localhost.localdomain kernel: e1000e 0000:00:1f.6:
Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
Apr 02 17:09:02 localhost.localdomain kernel: e1000e 0000:00:1f.6 eth1:
registered PHC clock
Apr 02 17:09:02 localhost.localdomain kernel: e1000e 0000:00:1f.6 eth1:
(PCI Express:2.5GT/s:Width x1) 00:1f:bc:0f:37:76
Apr 02 17:09:02 localhost.localdomain kernel: e1000e 0000:00:1f.6 eth1:
Intel(R) PRO/1000 Network Connection
Apr 02 17:09:02 localhost.localdomain kernel: e1000e 0000:00:1f.6 eth1:
MAC: 12, PHY: 12, PBA No: FFFFFF-0FF
Apr 02 17:09:03 localhost.localdomain kernel: e1000e 0000:00:1f.6
enp0s31f6: renamed from eth1
Apr 02 17:09:11 localhost.localdomain kernel: ahci 0000:00:17.0: port does
not support device sleep
Apr 02 17:09:29 localhost.localdomain kernel: e1000e 0000:00:1f.6
enp0s31f6: 10/100 speed: disabling TSO



On Fri, Apr 1, 2016 at 8:17 AM, <bugzilla-daemon@freedesktop.org> wrote:

> *Comment # 13 <https://bugs.freedesktop.org/show_bug.cgi?id=94725#c13> on
> bug 94725 <https://bugs.freedesktop.org/show_bug.cgi?id=94725> from Karol
> Herbst <freedesktop@karolherbst.de> *
>
> (In reply to Rashed Abdel-Tawab from comment #10 <https://bugs.freedesktop.org/show_bug.cgi?id=94725#c10>)> Created attachment 122654 [details] <https://bugs.freedesktop.org/attachment.cgi?id=122654> [details] <https://bugs.freedesktop.org/attachment.cgi?id=122654&action=edit>
> > dmesg using tentative fix
> >
> > Alexandre, using the tentative fix you uploaded it switches GPUs properly
> > now. Here is the dmesg from that since there are still some errors related
> > to power state in it. Also, for some reason lshw is returning this for the
> > nVidia GPU now:
> >
> > *-generic
> >                 description: Unassigned class
> >                 product: Illegal Vendor ID
> >                 vendor: Illegal Vendor ID
> >                 physical id: 0
> >                 bus info: pci@0000:01:00.0
> >                 version: ff
> >                 width: 32 bits
> >                 clock: 66MHz
> >                 capabilities: bus_master vga_palette cap_list rom
> >                 configuration: driver=nouveau latency=255 maxlatency=255
> > mingnt=255
> >                 resources: irq:129 memory:dc000000-dcffffff
> > memory:b0000000-bfffffff memory:c0000000-c1ffffff ioport:e000(size=128)
> > memory:dd000000-dd07ffff
> >
> > I don't know if that's related to this at all but before, if I set runpm=0
> > and run lshw, it would return the proper description (running it without
> > runpm would cause the system to hang)
>
> I have the same issue with bbswitch (and maybe vgaswitcheroo too). Basically
> this means that in d3cold we can't talke to the gpu and the information isn't
> cached or something like that.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You are the assignee for the bug.
>
>
> _______________________________________________
> Nouveau mailing list
> Nouveau@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/nouveau
>
>
Comment 15 Alexandre Courbot 2016-04-06 01:55:39 UTC
So according to Karol's comment it seems like the issue might be fixed. Rashed, can you confirm that the GPU is operational after runtime resume with the patch I posted?
Comment 16 Rashed Abdel-Tawab 2016-04-06 04:11:11 UTC
(In reply to Alexandre Courbot from comment #15)
> So according to Karol's comment it seems like the issue might be fixed.
> Rashed, can you confirm that the GPU is operational after runtime resume
> with the patch I posted?

I can confirm the driver no longer hangs on runtime resume, yes. I don't know how to offload to the GPU so I guess I can't say I know if its operational.
Comment 17 Rashed Abdel-Tawab 2016-04-06 04:52:14 UTC
Created attachment 122747 [details]
new dmesg during hang

I decided to keep testing this in case we missed something, and running lshw twice in a row causes it to hang. I'm not sure what's up with it so I've attached the dmesg. It looks pretty similar to before, but that doesn't make sense since Alexandre patched the original problem.
Comment 18 Karol Herbst 2016-04-06 12:08:28 UTC
uhh the card seems pretty much messed up after resume, because several things just fail.
Comment 19 Tobias Jakobi 2016-06-05 12:37:17 UTC
I think I'm seeing the same issue here on a Schenker XMG A506 notebook. The NV GPU is the dedicated one. AFAIK all display hardware is connected to the Intel iGPU.

Kernel is vanilla 4.6.1. Actually this is the first kernel (and the first time) that I try to use the dedicated GPU.

lspci:
01:00.0 VGA compatible controller: NVIDIA Corporation GM107M [GeForce GTX 960M] (rev a2)

Without runpm=0 even just calling DRI_PRIME=1 glxinfo leaves the system unresponsive shortly afterwards. SysRq still works though.

Going to attach the part of the log in a second.
Comment 20 Tobias Jakobi 2016-06-05 12:38:36 UTC
Created attachment 124326 [details]
nouveau run on GM170 with default runpm
Comment 21 Tobias Jakobi 2016-06-05 15:03:59 UTC
Created attachment 124332 [details]
Schenker XMD A506 acpidump
Comment 22 Peter Wu 2016-06-05 15:30:53 UTC
Tobias, your issue might be different and is caused by an issue with a Windows 10-specific workaround in the firmware for your Clevo-based hardware (notice the AML_INFINITE_LOOP in the PGON method). Workaround for you: add acpi_osi="!Windows 2015" to your cmdline.

@Rashed, @Efrem Please attach the uncompressed acpidump output.
Comment 23 maimu 2016-07-01 12:33:14 UTC
http://forums.fedoraforum.org/showthread.php?t=310422
Is this issue effected mines´╝č
Comment 24 Peter Wu 2016-07-16 17:14:38 UTC
@Efrem Your issue occurs with completely different (desktop) hardware. Please file a new bug.

(In reply to Rashed Abdel-Tawab from comment #17)
> Created attachment 122747 [details]
> new dmesg during hang
> 
> I decided to keep testing this in case we missed something, and running lshw
> twice in a row causes it to hang. I'm not sure what's up with it so I've
> attached the dmesg. It looks pretty similar to before, but that doesn't make
> sense since Alexandre patched the original problem.

Chris Wilson reported a similar trace for an Acer Aspire VN7-791G.
https://lists.freedesktop.org/archives/nouveau/2016-July/025602.html

I found ACPI information in the BIOS firmware for your laptop from
https://us.msi.com/Laptop/GS60-Ghost-Pro-4K-6th-Gen-GTX-970M.html
It could be the same PCI problem that I and Tobias are experiencing, except that our laptops hang in ACPI while Rashed's firmware does not check the PCIe link status in the ACPI firmware. Still digging...
Comment 25 Perry Yuan 2016-09-20 08:48:11 UTC
HI guys .
The Bug  is similar with my graphics panic , so I add related logs here .
When I boot up system ,system will hang at nouveau driver panic , repeat print nouveau driver  panic info , I must force down the system . 
If i add nouveau.runpm=0  ,system  can boot up normally  without panic .


My graphics : NVIDIA  GM107 
 

And the panic info is related to pci op.
---------------------------------------------------
[   21.303467] Bluetooth: RFCOMM socket layer initialized
[   21.303471] Bluetooth: RFCOMM ver 1.11
[   22.095603] nouveau 0000:01:00.0: DRM: evicting buffers...
[   22.095605] nouveau 0000:01:00.0: DRM: waiting for kernel channels to go idle...
[   22.095622] nouveau 0000:01:00.0: DRM: suspending client object trees...
[   22.098407] nouveau 0000:01:00.0: DRM: suspending kernel object tree...
[   22.220120] FAT-fs (sdb1): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.
[   23.669052] Non-volatile memory driver v1.3
[   24.083662] pci_raw_set_power_state: 48 callbacks suppressed
[   24.083665] nouveau 0000:01:00.0: Refused to change power state, currently in D3
[   24.159560] nouveau 0000:01:00.0: Refused to change power state, currently in D3
[   24.179571] nouveau 0000:01:00.0: Refused to change power state, currently in D3
[   24.179575] nouveau 0000:01:00.0: DRM: resuming kernel object tree...
[   24.179617] nouveau 0000:01:00.0: pci: failed to adjust cap speed
[   24.179618] nouveau 0000:01:00.0: pci: failed to adjust lnkctl speed
[   24.317867] wlp2s0: authenticate with f0:b4:29:87:9e:e4
---------------------------------------------------
and the frequently panic is :


ul 21 18:38:50 localhost kernel: [<ffffffffa03c463b>] ? nv04_timer_read+0x2b/0x70 [nouveau]
Jul 21 18:38:50 localhost kernel: [<ffffffffa03c41af>] nvkm_timer_read+0xf/0x20 [nouveau]
Jul 21 18:38:50 localhost kernel: [<ffffffffa03bc598>] nvkm_pmu_init+0x58/0x480 [nouveau]
Jul 21 18:38:50 localhost kernel: [<ffffffffa0374e8e>] nvkm_subdev_init+0xee/0x230 [nouveau]
Jul 21 18:38:50 localhost kernel: [<ffffffffa03c84df>] nvkm_device_init+0x18f/0x280 [nouveau]
Jul 21 18:38:50 localhost kernel: [<ffffffffa03cc138>] nvkm_udevice_init+0x48/0x60 [nouveau]
Jul 21 18:38:50 localhost kernel: [<ffffffffa03736c0>] nvkm_object_init+0x50/0x1c0 [nouveau]
Jul 21 18:38:50 localhost kernel: [<ffffffffa0373705>] nvkm_object_init+0x95/0x1c0 [nouveau]
Jul 21 18:38:50 localhost kernel: [<ffffffffa037062e>] nvkm_client_init+0xe/0x10 [nouveau]
Jul 21 18:38:50 localhost kernel: [<ffffffffa04125fe>] nvkm_client_resume+0xe/0x10 [nouveau]
Jul 21 18:38:50 localhost kernel: [<ffffffffa036f7b4>] nvif_client_resume+0x14/0x20 [nouveau]
Jul 21 18:38:50 localhost kernel: [<ffffffffa040fbed>] nouveau_do_resume+0x4d/0x130 [nouveau]
Jul 21 18:38:50 localhost kernel: [<ffffffffa041000c>] nouveau_pmops_runtime_resume+0x7c/0x120 [nouveau]
Jul 21 18:38:50 localhost kernel: [<ffffffff81350c8b>] pci_pm_runtime_resume+0x7b/0xc0
Jul 21 18:38:50 localhost kernel: [<ffffffff81350c10>] ? pci_restore_standard_config+0x40/0x40
Jul 21 18:38:50 localhost kernel: [<ffffffff8142ddb6>] __rpm_callback+0x36/0xc0
Jul 21 18:38:50 localhost kernel: [<ffffffff8142de64>] rpm_callback+0x24/0x80
Jul 21 18:38:50 localhost kernel: [<ffffffff8142ee39>] rpm_resume+0x4e9/0x670
Jul 21 18:38:50 localhost kernel: [<ffffffff812a7ffb>] ? cred_has_capability+0x6b/0x120
Jul 21 18:38:50 localhost kernel: [<ffffffff8142f00f>] __pm_runtime_resume+0x4f/0x80
Jul 21 18:38:50 localhost kernel: [<ffffffffa04107bb>] nouveau_drm_open+0x3b/0x1b0 [nouveau]
Jul 21 18:38:50 localhost kernel: [<ffffffff812a80de>] ? selinux_capable+0x2e/0x40
Jul 21 18:38:50 localhost kernel: [<ffffffff812a1cc8>] ? security_capable+0x18/0x20
Jul 21 18:38:50 localhost kernel: [<ffffffffa008add6>] drm_open+0x1f6/0x470 [drm]
Jul 21 18:38:50 localhost kernel: [<ffffffffa0091759>] drm_stub_open+0xa9/0x120 [drm]
Jul 21 18:38:50 localhost kernel: [<ffffffff811fc4b1>] chrdev_open+0xa1/0x1e0
Jul 21 18:38:50 localhost kernel: [<ffffffff811f5567>] do_dentry_open+0x1a7/0x2e0
Jul 21 18:38:50 localhost kernel: [<ffffffff812a21ac>] ? security_inode_permission+0x1c/0x30
Jul 21 18:38:50 localhost kernel: [<ffffffff811fc410>] ? cdev_put+0x30/0x30
Jul 21 18:38:50 localhost kernel: [<ffffffff811f573d>] vfs_open+0x5d/0xd0
Jul 21 18:38:50 localhost kernel: [<ffffffff81203058>] ? may_open+0x68/0x110
Jul 21 18:38:50 localhost kernel: [<ffffffff81206b3d>] do_last+0x1ed/0x12a0
Jul 21 18:38:50 localhost kernel: [<ffffffff811d9c96>] ? kmem_cache_alloc_trace+0x1d6/0x200
Jul 21 18:38:50 localhost kernel: [<ffffffff81207cb2>] path_openat+0xc2/0x490
Jul 21 18:38:50 localhost kernel: [<ffffffff8120947b>] do_filp_open+0x4b/0xb0
Jul 21 18:38:50 localhost kernel: [<ffffffff812160a7>] ? __alloc_fd+0xa7/0x130
Jul 21 18:38:50 localhost kernel: [<ffffffff811f6a93>] do_sys_open+0xf3/0x1f0
Jul 21 18:38:50 localhost kernel: [<ffffffff811f6bae>] SyS_open+0x1e/0x20
Jul 21 18:38:50 localhost kernel: [<ffffffff816956c9>] system_call_fastpath+0x16/0x1b
Comment 26 Orsiris de Jong 2016-12-02 09:52:40 UTC
I too have a optimus laptop with GM107 / intel graphics card, and I have not been able to use nouveau driver since I bought the laptop (using kernels 4.3.5 up to 4.8.10 on Fedora 23-25 x64).

I have searched for multiple solutions so far, and always finished by blacklisting nouveau in order to be able to use that laptop.
I could not even able to launch lspci without rendering my system unresponsive (process then takes 100% cpu and I get cpu softlocks)

I've tried nouveau.runpm=0 and well, I can at least boot without having nouveau totally disabled :)

When not using runpm=0, I get the following when launching lspci / or DRI_PRIME=1 glxgears

[ 1470.121123] nouveau 0000:01:00.0: Refused to change power state, currently in D3
[ 1470.193738] nouveau 0000:01:00.0: Refused to change power state, currently in D3
[ 1470.205762] nouveau 0000:01:00.0: Refused to change power state, currently in D3
[ 1470.205766] nouveau 0000:01:00.0: DRM: resuming kernel object tree...
[ 1470.205792] nouveau 0000:01:00.0: pci: failed to adjust cap speed
[ 1470.205793] nouveau 0000:01:00.0: pci: failed to adjust lnkctl speed

So I guess the nouveau driver cannot send to sleep / wake my nvidia card for whatever reason.

I am actually trying rely on the discrete graphics in order to reduce power consumption by default.

I noticed that I can't use vgaswitcheroo in order to chose my primary graphic card. Also noticed the following error messages on boot.
[    1.795356] [drm] Initialized drm 1.1.0 20060810
[    1.881362] ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
[    1.881428] ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
[    1.881552] ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
[    1.881753] VGA switcheroo: detected Optimus DSM method \_SB_.PCI0.PEG0.PEGP handle
[    1.881753] nouveau: detected PR support, will not use DSM

So far, when I switch graphic cards, I get the following in dmesg:

[ 1174.921195] nouveau 0000:01:00.0: DRM: resuming kernel object tree...
[ 1174.921209] nouveau 0000:01:00.0: DRM: resuming client object trees...
[ 1174.921226] nouveau 0000:01:00.0: fifo: BIND_ERROR 01 [BIND_NOT_UNBOUND]
[ 1317.637168] vga_switcheroo: client 0 refused switch

I guess that nouveau driver does not allow to disable the nvidia card when runpm=0 is passed as argument ?

Anything I could try please ?
Comment 27 Peter Wu 2016-12-02 10:54:00 UTC
Orsiris What laptop model do you have? It sounds like you are affected by
https://bugzilla.kernel.org/show_bug.cgi?id=156341

See also
https://github.com/Bumblebee-Project/Bumblebee/issues/764#issuecomment-234494238
Comment 28 Pablo Cholaky 2016-12-02 15:07:04 UTC
Can you try using pcie_port_pm=off into your boot flags?
Comment 29 Orsiris de Jong 2016-12-04 09:09:19 UTC
@Peter I have a MSI GE62 6QD (skylake i67-700HQ) which is on the list of both links you gave me.
Using acpi_osi=! acpi_osi="Windows 2009", I can launch lspci, but the system freezes a minute or so later with 

[   92.249957] nouveau 0000:01:00.0: DRM: resuming kernel object tree...
[   92.337917] nouveau 0000:01:00.0: DRM: resuming client object trees...
[   98.017700] nouveau 0000:01:00.0: DRM: evicting buffers...
[   98.017704] nouveau 0000:01:00.0: DRM: waiting for kernel channels to go idle...
[   98.017726] nouveau 0000:01:00.0: DRM: suspending client object trees...
[   98.020446] nouveau 0000:01:00.0: DRM: suspending kernel object tree...
[   98.617684] nouveau 0000:01:00.0: priv: HUB0: 10ecc0 ffffffff (1e40822c)

followed by a cpu lock.

Is there any debug info I can provide ?

@Pablo I have tried pcie_port_pm=off alone or together with the acpi_osi boot flags, but it doesn't seem to make any difference.
Comment 30 rmokkink 2017-01-02 10:50:58 UTC
Hi,

I have the same problems with my new MSI GP62 6QF laptop.
It has a geforce gtx960m videocard. I installed fedora 25 on it. The only way to login into the system (gnome) was to disable the nouveau powermanagement with grub option nouveau.runpm=0.

I did manage to login into the system with option:
acpi_osi=! "acpi_os=Windows 2013"

In /sys/kernel/debug/vgaswitcheroo/switch is see the following:
0:IGD:+:Pwr:0000:00:02.0
1:DIS: :DynOff:0000:01:00.0

But it looks like it is not working properly, i see the red light on the power button, meaning the other card is active as well.

I also tried the option:

i915.modeset=-1 nouveau.runpm=-1, occasionally i can login, but the the system freezes completely.

It this a kernel bug or nouveau bug?

Thanks.
Comment 31 Martin Peres 2019-12-04 09:11:12 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/issues/257.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.