Bug 98398 - Acer Aspire V7-582PG (Haswell, GTX 750M) fails to power off GPU with runtime PM
Summary: Acer Aspire V7-582PG (Haswell, GTX 750M) fails to power off GPU with runtime PM
Status: ASSIGNED
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/nouveau (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Nouveau Project
QA Contact: Xorg Project Team
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-10-22 21:23 UTC by Rick Kerkhof
Modified: 2016-11-24 00:24 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg output (55.24 KB, text/plain)
2016-10-22 21:23 UTC, Rick Kerkhof
no flags Details
acpidump output (429.11 KB, text/plain)
2016-10-22 21:39 UTC, Rick Kerkhof
no flags Details
lspci -tnnv output (1.17 KB, text/plain)
2016-10-22 21:43 UTC, Rick Kerkhof
no flags Details
New dmesg output after manually turning on power management (62.48 KB, text/plain)
2016-10-22 22:17 UTC, Rick Kerkhof
no flags Details
lspci -nnvvv output (12.08 KB, text/plain)
2016-10-22 22:18 UTC, Rick Kerkhof
no flags Details
Disable d3cold on bridge when falling back to _DSM (1.07 KB, patch)
2016-11-12 23:01 UTC, Peter Wu
no flags Details | Splinter Review
attachment-21565-0.html (2.22 KB, text/html)
2016-11-15 21:30 UTC, Rick Kerkhof
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Rick Kerkhof 2016-10-22 21:23:51 UTC
Created attachment 127483 [details]
dmesg output

Using Arch Linux and kernel 4.8.3, I am observing much higher power usage in powertop when using Nouveau with vgaswitcharoo (~13W) opposed to NVIDIA/Bumblebee with BBSwitch (~7.5 W).

I initially started noticing this because the battery drained much faster and the fans started to spin while they otherwise stayed idle.

Attached is my dmesg log.
Comment 1 Rick Kerkhof 2016-10-22 21:39:04 UTC
Created attachment 127484 [details]
acpidump output
Comment 2 Rick Kerkhof 2016-10-22 21:43:48 UTC
Created attachment 127485 [details]
lspci -tnnv output
Comment 3 Rick Kerkhof 2016-10-22 22:17:55 UTC
Created attachment 127486 [details]
New dmesg output after manually turning on power management

So Lekensteyn asked me to run a few commands to also shut off the bus the GPU is on.

If I do so and then proceed by running powertop on battery, intel_pstate complains about turbo not being available, a usb device disconnects, a while later Nouveau resumes kernel object trees, Bluetooth reconnects and nouveau suspends again.

The commands are:
# echo auto > /sys/bus/pci/devices/0000\:00\:1c.4/power/control 
# grep . /sys/bus/pci/devices/0000:0{0:1c.4,1:00.0}/power/{control,runtime_status}

The latter returns auto, suspended, auto, suspended before running powertop, and on, active, auto, suspended after running powertop.
Comment 4 Rick Kerkhof 2016-10-22 22:18:54 UTC
Created attachment 127487 [details]
lspci -nnvvv output
Comment 5 Rick Kerkhof 2016-10-22 23:16:29 UTC
Adding pcie_port_pm=off to my kernel command line causes the card to turn off and powertop to report ~7.5W of power usage. According to Lekensteyn this reverts nouveau to the 4.7 and lower behavior of using DSM, so I think this is a regression from using the new method.
Comment 6 Rick Kerkhof 2016-10-22 23:24:49 UTC
Booting without pcie_port_pm=off, while blacklisting nouveau on boot, then executing:
 echo 0 > /sys/bus/pci/devices/0000:01:00.0/d3cold_allowed && modprobe nouveau

also causes powertop to report a ~7.5W value.
Comment 7 Pablo Cholaky 2016-10-23 14:17:48 UTC
Just to add extra info here, this problem also happens with bbswitch https://github.com/Bumblebee-Project/bbswitch/issues/140

Guys, do you know if this is really a bug from Linux or a feature? I meant, if changes to fix this problem would be at kernel side (PM team) or kernel interface side (vgaswitcheroo / bbswitch) ?

Regards
Comment 8 Peter Wu 2016-10-24 10:09:46 UTC
Pablo, the issues that bbswitch has is different from the one reported here. bbswitch is not updated for 4.8 requiring the pcie_port_pm=off workaround.

There are more details for this bug from the reporter in IRC (search for NanoSector): https://people.freedesktop.org/~cbrill/dri-log/index.php?channel=nouveau&date=2016-10-22

In particular, Rick reported that the issue apparently also appears with older kernels, including 4.3 to 4.8. This is significant and a surprising result because kernel 4.8 plus pcie_port_pm=off (or the d3cold_allowed change) should have the same result as 4.7 or before. Rick, can you re-test it with 4.7?

It also occurs to me that older kernels might not support your GPU, so be sure to keep a dmesg around.
Comment 9 Rick Kerkhof 2016-10-24 13:21:47 UTC
Sure, I'll have another test run with 4.7 this week.
Comment 10 Rick Kerkhof 2016-10-26 19:35:24 UTC
Hmm I just installed Linux 4.7.6 and ran it without any additional kernel parameters and I am getting results close to ˜7.5W too, so it seems to work there.
Comment 11 Peter Wu 2016-10-26 22:42:07 UTC
So 4.7 and before used the "DSM" method on runtime-suspend:
- \_SB.PCI0.RP05.PEGP._DSM would be invoked to enable Optimus
- \_SB.PCI0.RP05.PEGP._PS3 is then invoked which would enter D3cold
(note, this method is still used in 4.8 on older laptops or with the pcie_pm_port=off kernel option)

Since 4.8, _DSM is not called anymore by nouveau (when support from the PCI core is detected) and this sequence should instead happen:
- \_SB.PCI0.RP05.PEGP._PS3 (does nothing besides updating _STA)
- PCIe core removes power for the PCIe port since all its children are in
  D3 and are willing to transition to D3cold. It does so by invoking
  \NVP3._OFF (where \NVP3 is the power resource from \_SB.PCI0.RP05._PR3)

That is how I think it should work in theory, but on Ricks laptop running 4.8.4,
/sys/bus/devices/0000:1c.4/firmware_node/ does not have power_resources_D0 devices (which I do have on my own laptop for 0000:01:0).

The SSDT1 of Rick's Acer laptop shows this structure:

    If (\_OSI ("Windows 2013"))
    {
        Scope (\_SB.PCI0.RP05)
        {
        //...
            Name (_PR0, Package (0x01)  // _PR0: Power Resources for D0
            {
                NVP3
            })
            Name (_PR2, Package (0x01)  // _PR2: Power Resources for D2
            {
                NVP2
            })
            Name (_PR3, Package (0x01)  // _PR3: Power Resources for D3hot
            {
                NVP3
            })
            // ...
            Method (_PS0, 0, NotSerialized)  // _PS0: Power State 0
            {
            }

            Method (_PS3, 0, NotSerialized)  // _PS3: Power State 3
            {
            }
        }

        Name (MSD3, Zero)
        PowerResource (NVP3, 0x00, 0x0000)
        {
            Name (_STA, One)  // _STA: Status
            // ...

            Method (_ON, 0, NotSerialized)  // _ON_: Power On
            {
                // ...
            }

            Method (_OFF, 0, NotSerialized)  // _OFF: Power Off
            {
                // ...
            }
        }

The dmesg does show "ACPI: Power Resource [NVP3] (on)", so I guess that the methods are found. It is a mystery to me why the "power_resources_Dx" files are not created, possibly breaking PM.
Comment 12 Peter Wu 2016-10-28 23:23:18 UTC
At the moment it looks like an ACPI core bug which manifested in nouveau. See
https://lists.freedesktop.org/archives/nouveau/2016-October/026395.html and the replies. I'll post a workaround patch soon.
Comment 13 Peter Wu 2016-11-12 23:01:20 UTC
Created attachment 127942 [details] [review]
Disable d3cold on bridge when falling back to _DSM

The workaround patch has been merged in v4.9-rc3-34-gb0a6af8 (and backported to 4.8.7 via v4.8.6-109-g7290da4) but apparently it broke (system?) suspend/resume according to the reporter.

Before the workaround patch:
 - _PR3 method is found, so nouveau assumes that PCI core takes care of D3cold.
 - Due to an ACPICA bug, PCI core fails to power off the device via runtime PM:
   https://bugs.acpica.org/show_bug.cgi?id=1333

After the workaround patch I guess that this happens:

 - _PR3 method is found, but unusable. Nouveau falls back to _DSM.
 - Due to the above ACPICA bug, the power resources not owned by any device. I
   guess that Linux then decides to power off the "unnecessary" power resource
   after system resume. (I saw something like this in a dmesg for a similiar SSDT)
 - At this point I would guess that nouveau then follows the old DSM method, but
   then I am confused because pcie_port_pm=off (or pre-4.8 kernels) supposedly
   have the same issue with this power resource.

If pcie_port_pm=off helps, then the attached patch should also work (no pcie_port_pm=off needed). Can you give it a try on top of v4.8.7?
Comment 14 Peter Wu 2016-11-12 23:07:20 UTC
Rick, were you actually able to suspend the system with kernel 4.7 and nouveau?
Bug 98582 has a similar acpidump and claims that v4.7 also failed to suspend (actually, resume).
Comment 15 Rick Kerkhof 2016-11-15 21:30:05 UTC
Created attachment 127999 [details]
attachment-21565-0.html

Op zo 13 nov. 2016 om 00:07 schreef <bugzilla-daemon@freedesktop.org>:

> *Comment # 14 <https://bugs.freedesktop.org/show_bug.cgi?id=98398#c14> on
> bug 98398 <https://bugs.freedesktop.org/show_bug.cgi?id=98398> from Peter
> Wu <peter@lekensteyn.nl> *
>
> Rick, were you actually able to suspend the system with kernel 4.7 and nouveau?Bug 98582 <https://bugs.freedesktop.org/show_bug.cgi?id=98582> has a similar acpidump and claims that v4.7 also failed to suspend
> (actually, resume).
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
> Using pcie_port_pm=off on kernel 4.8 does not make resuming work; it still
hangs on resuming with a black screen and no backlight (and sometimes a
pointer with a black background).
Comment 16 Peter Wu 2016-11-24 00:24:28 UTC
Rick reported that system suspend did not work before the patch either, so there is no regression in that sense.

ACPICA developers are faster than expected, can you test these three patches:
https://bugs.acpica.org/show_bug.cgi?id=1333#c45


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.