Bug 98502

Summary: Delay when starting firefox, thunderbird or chromium and dmesg spam
Product: Mesa Reporter: Mauro Santos <registo.mailling>
Component: Mesa coreAssignee: mesa-dev
Status: RESOLVED FIXED QA Contact: mesa-dev
Severity: normal    
Priority: medium CC: andyrtr, emil.l.velikov, eugene.shalygin+bugzilla.FDO
Version: 13.0   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: Bisect log

Description Mauro Santos 2016-10-30 14:24:37 UTC
With mesa 13.0.0rc2 I'm seeing a 2 to 3 second delay when starting firefox, thunderbird or chromium. This happens every time the programs are closed (all instances) and run again. Glxgears (without prime offloading) also seems to be affected as well as lspci, this wasn't the case before with mesa 12.0.3.

When this happens the following messages are logged in dmesg:

[drm] probing gen 2 caps for device 8086:9d18 = 9724043/e
[drm] enabling PCIE gen 3 link speeds, disable with radeon.pcie_gen2=0
[drm] PCIE GART of 2048M enabled (table at 0x00000000001D6000).
radeon 0000:03:00.0: WB enabled
radeon 0000:03:00.0: fence driver on ring 0 use gpu addr 0x0000000080000c00 and cpu addr 0xffff88042bbafc00
radeon 0000:03:00.0: fence driver on ring 1 use gpu addr 0x0000000080000c04 and cpu addr 0xffff88042bbafc04
radeon 0000:03:00.0: fence driver on ring 2 use gpu addr 0x0000000080000c08 and cpu addr 0xffff88042bbafc08
radeon 0000:03:00.0: fence driver on ring 3 use gpu addr 0x0000000080000c0c and cpu addr 0xffff88042bbafc0c
radeon 0000:03:00.0: fence driver on ring 4 use gpu addr 0x0000000080000c10 and cpu addr 0xffff88042bbafc10
radeon 0000:03:00.0: fence driver on ring 5 use gpu addr 0x0000000000075a18 and cpu addr 0xffffc90002235a18
radeon 0000:03:00.0: failed VCE resume (-110).
[drm] ring test on 0 succeeded in 1 usecs
[drm] ring test on 1 succeeded in 1 usecs
[drm] ring test on 2 succeeded in 1 usecs
[drm] ring test on 3 succeeded in 4 usecs
[drm] ring test on 4 succeeded in 4 usecs
[drm] ring test on 5 succeeded in 2 usecs
[drm] UVD initialized successfully.
[drm] ib test on ring 0 succeeded in 0 usecs
[drm] ib test on ring 1 succeeded in 0 usecs
[drm] ib test on ring 2 succeeded in 0 usecs
[drm] ib test on ring 3 succeeded in 0 usecs
[drm] ib test on ring 4 succeeded in 0 usecs
[drm] ib test on ring 5 succeeded

I'm using a laptop with an Intel iGPU with and an AMD dGPU, it seems that with mesa 13.0.0rc2 all the gpus in the system (iGPU+dGPU) are probed instead of only the gpu that is going to be used (iGPU).

Given that the dGPU is usually automatically turned off for power saving it takes 2 to 3 seconds to be re-initialized and this causes the delay on startup for the programs I've mentioned before (I suspect many more are affected).

This might not be an actual bug (new intended behavior?) but it is a change in behavior and it is annoying as opening any affected application will incur a 2 to 3 second delay on startup, even if all files are already cached in ram and the startup would otherwise be virtually instant.

OS: Arch Linux x86-64, kernel 4.8.4, mesa 13.0.0rc2, xorg 1.18.4 using the modesetting driver for all cards.

Machine: Lenovo E560, iGPU HD Graphics 520 (rev 07), dGPU [AMD/ATI] Mars [Radeon HD 8670A/8670M/8750M] (rev 81)
Comment 1 Alex Deucher 2016-10-30 16:13:23 UTC
If this is a regression can you bisect?
Comment 2 Mauro Santos 2016-10-30 17:29:54 UTC
I'll try to bisect and see if I can find the offending commit, I'll post back when I have a result.

Are there any tests you'd like me to do or other information you'd like me to post?
Comment 3 Mauro Santos 2016-10-30 23:41:56 UTC
After bisecting (hope I've done it right), the first bad commit is be239326aa4f9317d42ee07f0f51179c8b3d5b22

I've tried to revert this commit and test again but I haven't been able to do so.

Bisect log is attached.
Comment 4 Mauro Santos 2016-10-30 23:42:15 UTC
Created attachment 127635 [details]
Bisect log
Comment 5 Emil Velikov 2016-10-31 10:51:22 UTC
That's because the drmDevice API retrieves the device revision id.

IIRC that was something explicitly requested by Jammy and the only way to get the info is to parse ./config which wakes up the device.

AFAICT if using libudev one also needs to parse ./config since the kernel does not expose a revision_id file/info like the {subsystem_,}{device,vendor} ones.

I think we'd really want to change that [in the long term]. There was a patch for it, but seems like it never got sent/merged.
Comment 6 Emil Velikov 2016-10-31 10:59:55 UTC
Short term ideas:
libdrm
 - cheat, don't parse ./config. Read the revision_id file and set to zero if missing. Default for all or toggle {config,revision_id} by envvar/API ?
 - roll revision_id-less API. Duplication of (already tad nasty) code :-(

mesa
 - go back to manual sysfs parsing in mesa
I'd like to avoid that if possible - makes BSD/other integration cumbersome, code duplication (?)


Fwiw, I'm leaning for the "cheat" since no open-source user-space (seems to) depend on the revision_id value, and aim for getting a kernel patch for revision_id upstreamed.

Opinions/other ideas ?
Comment 7 Mauro Santos 2016-11-01 16:14:12 UTC
As a user, any band aid that will mask the problem until the proper long term fix is in place will be very much welcomed.

As to how it is implemented, I am not familiar with the mesa and/or libdrm code, if you think the cheat option will be the best stopgap measure I'll trust your judgement.
Comment 8 Emil Velikov 2016-11-01 16:30:17 UTC
The kernel patch is out and should fit the mailing list archives in due time. 

Michel since Jammy is no longer around can you check (forward to AMDGPU-PRO team)  about the implications of the "cheat" route, please ?
Comment 9 Michel Dänzer 2016-11-02 01:20:15 UTC
I'm not sure what exactly you're asking me to do. Can you just ask on the amd-gfx mailing list directly?
Comment 10 Emil Velikov 2017-02-20 20:20:59 UTC
Guys, the kernel (v4.10) has been updated to provide the revision field [1], at the same time libdrm (v2.4.75) has API which does not fetch it [2] and the mesa patches [3] to us the new API has been updated.

Sadly applying [3] triggers a libtool bug^Wfeature where linking fails on the Intel CI. Once we get a workaround for that I'll apply [3].


[1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=702ed3be1b1bf4dea05954168321741c0910c645
[2] https://cgit.freedesktop.org/mesa/drm/commit/?id=11687bf4180f7e21045ed9c4730533c40fe01ea5
[3] https://patchwork.freedesktop.org/series/19714/
Comment 11 Eugene Shalygin 2017-02-27 22:22:41 UTC
I have the same problem: Qt5 wakes up dGPU on launch [1]. However, I believe that this is a kernel problem: the dGPU wakes up when a program tries to read from /sys/bus/pci/devices/<bus id>/config. This was not happening with 4.7 and when I use Arch's current LTS kernel (4.4) it also works fine [2].

I can't believe that this is intentional, because a) waking up dGPU on every program launch does not make any sense and b) when the dGPU is disable via acpi_call, any program hangs reading the same file. What puzzles me most is the following: I have two almost identical laptops (for this bug they can be treated as identical, I believe) and the second one, running Gentoo, is free from this problem with the same versions of kernel. Moreover, with kernel configured using config from the second one, the first one still shows the problematic behaviour. 

[1] https://bugs.archlinux.org/task/52452
[2] https://bugzilla.kernel.org/show_bug.cgi?id=194721
Comment 12 Tobias Droste 2017-02-27 22:50:49 UTC
(In reply to Eugene Shalygin from comment #11)
> I have the same problem: Qt5 wakes up dGPU on launch [1]. However, I believe
> that this is a kernel problem: the dGPU wakes up when a program tries to
> read from /sys/bus/pci/devices/<bus id>/config. This was not happening with
> 4.7 and when I use Arch's current LTS kernel (4.4) it also works fine [2].

As soon as all the patches landed (from comment #10) there's no need to check /sys/bus/pci/devices/<bus id>/config anymore, because there's a new API to get that info that doesn't need to wake up the device and the problem goes away. But right now the mesa patches are not applied as far as I can tell.

> 
> I can't believe that this is intentional [...]

It's not :-)
Comment 13 Mauro Santos 2017-02-27 23:31:52 UTC
(In reply to Tobias Droste from comment #12)
> (In reply to Eugene Shalygin from comment #11)
> > I have the same problem: Qt5 wakes up dGPU on launch [1]. However, I believe
> > that this is a kernel problem: the dGPU wakes up when a program tries to
> > read from /sys/bus/pci/devices/<bus id>/config. This was not happening with
> > 4.7 and when I use Arch's current LTS kernel (4.4) it also works fine [2].
> 
> As soon as all the patches landed (from comment #10) there's no need to
> check /sys/bus/pci/devices/<bus id>/config anymore, because there's a new
> API to get that info that doesn't need to wake up the device and the problem
> goes away. But right now the mesa patches are not applied as far as I can
> tell.
> 

Aren't most patches in mesa and libdrm by now? At least the versions provided by Arch already seem to have the patches (or at least some of them).

The only missing piece seems to be the kernel patch which should land once kernel 4.10 leaves testing.

However 4.9 does not have the patch and as it an LTS release it will stick around for a while, it would be nice if the sysfs patch was also included in 4.9.

> > 
> > I can't believe that this is intentional [...]
> 
> It's not :-)
Comment 14 Emil Velikov 2017-02-27 23:48:29 UTC
(In reply to Tobias Droste from comment #12)
> (In reply to Eugene Shalygin from comment #11)
> > I have the same problem: Qt5 wakes up dGPU on launch [1]. However, I believe
> > that this is a kernel problem: the dGPU wakes up when a program tries to
> > read from /sys/bus/pci/devices/<bus id>/config. This was not happening with
> > 4.7 and when I use Arch's current LTS kernel (4.4) it also works fine [2].
> 
> As soon as all the patches landed (from comment #10) there's no need to
> check /sys/bus/pci/devices/<bus id>/config anymore, because there's a new
> API to get that info that doesn't need to wake up the device and the problem
> goes away. But right now the mesa patches are not applied as far as I can
> tell.
> 
Yes and no - yes the patches in comment #10 will help from graphics POV, but no they won't help with the 'generic case' that Eugene is talking about. With "generic case" being anyone reading the ./config file will wake up the device.

Eugene if behaviour has changed across kernel versions [i.e. you're confident that with kernel X the dGPU was powered off and reading ./config did not wake it up] feel free to bisect and report it.
Comment 15 Emil Velikov 2017-02-27 23:55:03 UTC
(In reply to Mauro Santos from comment #13)

> Aren't most patches in mesa and libdrm by now? At least the versions
> provided by Arch already seem to have the patches (or at least some of them).
> 
You know that everything is open-source right ;-)

Arch has also gone full r****d mode and is patching mesa for EGL GLVND, but none of the above patches are included. Yet the kernel and libdrm are vanilla.
Comment 16 Eugene Shalygin 2017-02-28 01:29:36 UTC
> Eugene if behaviour has changed across kernel versions...

I'm confident that from 2014 and up to 4.10.0 on my Gentoo machine this was never the case. I don't quite understand what do I need to bisect, because the same kernel sources (4.9.11) compiled using the same config give different results on two identical laptops. May it be that a difference in environment (at compile-time or run-time) causes this problem?
Comment 17 Eugene Shalygin 2017-02-28 01:56:54 UTC
To clarify: lspci was never waking up the dGPU. I assume that it was always reading the device config file.
Comment 18 Emil Velikov 2017-02-28 09:40:00 UTC
(In reply to Eugene Shalygin from comment #17)
> To clarify: lspci was never waking up the dGPU. I assume that it was always
> reading the device config file.

The kernel behaviour change(?) does sound surprising. I suggest checking that the device was powered off in both cases - leaning that it might not have been.

That said, it sounds like the root of your issue is different than the one covered here - pardon for dragging you :-\
Comment 19 Mauro Santos 2017-02-28 12:39:41 UTC
(In reply to Emil Velikov from comment #18)
> (In reply to Eugene Shalygin from comment #17)
> > To clarify: lspci was never waking up the dGPU. I assume that it was always
> > reading the device config file.
> 
> The kernel behaviour change(?) does sound surprising. I suggest checking
> that the device was powered off in both cases - leaning that it might not
> have been.
 
I do have to say that I see the same behavior, although at the time what made me notice the change is behavior was a mesa update. The only change for the following tests was booting with a different kernel.

With 4.4.52 lspci says:  
03:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Mars [Radeon HD 8670A/8670M/8750M] [1002:6600] (rev ff)

With 4.9.11 lspci says:
03:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Mars [Radeon HD 8670A/8670M/8750M] [1002:6600] (rev 81)

Notice that the revision is different so I would bet the kernel was not behaving properly before.

Also /sys/kernel/debug/dri/1/radeon_pm_info says the following right after using lspci:
With kernel 4.4.52:
PX asic powered off

With kernel 4.9.11:
uvd    vclk: 0 dclk: 0
power level 0    sclk: 30000 mclk: 15000 vddc: 900 vddci: 0 pcie gen: 2

> That said, it sounds like the root of your issue is different than the one
> covered here - pardon for dragging you :-\
Comment 20 Alex Deucher 2017-02-28 16:11:30 UTC
(In reply to Mauro Santos from comment #19)
> With 4.4.52 lspci says:  
> 03:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI]
> Mars [Radeon HD 8670A/8670M/8750M] [1002:6600] (rev ff)
> 
> With 4.9.11 lspci says:
> 03:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI]
> Mars [Radeon HD 8670A/8670M/8750M] [1002:6600] (rev 81)
> 
> Notice that the revision is different so I would bet the kernel was not
> behaving properly before.

Previously lspci would just read back from pci config space for stuff that was not cached which resulted in reading back all ones if the device was powered off, hence the rev ff.  The pci revision id is now cached so it's no longer read back from pci config space which is why is shows the proper value (rev 81).
Comment 21 Mauro Santos 2017-02-28 17:20:47 UTC
(In reply to Alex Deucher from comment #20)
> Previously lspci would just read back from pci config space for stuff that
> was not cached which resulted in reading back all ones if the device was
> powered off, hence the rev ff.  The pci revision id is now cached so it's no
> longer read back from pci config space which is why is shows the proper
> value (rev 81).

You lost me here, I assume that by caching you mean the sysfs patch that adds the revision field. I have tested with 4.4.52 and 4.9.11 and those do not have that patch.

The difference between both is that with 4.4.52 the device stays powered off when using lspci but with 4.9.11 the device is automatically powered on. This is using all the same software versions, the only difference is booting into different kernel versions.
Comment 22 Eugene Shalygin 2017-02-28 20:43:20 UTC
(In reply to Mauro Santos from comment #21)
> The difference between both is that with 4.4.52 the device stays powered off
> when using lspci but with 4.9.11 the device is automatically powered on.
> This is using all the same software versions, the only difference is booting
> into different kernel versions.

I don't quite understand the situation too. I observe two opposite behaviours with the same kernel version (4.9): lspci wakes up dGPU on ArchLinux or not (Gentoo). Which one is bugged?
Comment 23 Emil Velikov 2017-03-01 13:16:57 UTC
(In reply to Eugene Shalygin from comment #22)

> I don't quite understand the situation too. I observe two opposite
> behaviours with the same kernel version (4.9): lspci wakes up dGPU on
> ArchLinux or not (Gentoo). Which one is bugged?

As I said earlier - first, you want to ensure that the dGPU is powered off in _both_ cases (Arch vs Gentoo). Assuming you're OK with building your own kernel a couple of printfs is all you need.

If the state is not the same - check you can the conditions under which the device should be powered-off - powertop toggles, something using your dGPU, etc.

If they are - check with the lspci/other version and thus code used on both systems.

In either case all this sounds unrelated to this bug, so I'd suggest keeping it separate.
Comment 24 Emil Velikov 2017-03-15 11:51:55 UTC
Back to the original issue:

kernel, libdrm and mesa changes are in v4.10, v2.4.75 and [to be announced] V17.1.0 respectively.
Comment 25 Eugene Shalygin 2017-04-07 19:29:32 UTC
Just found that dGPU wakes up when I connect an Android phone via USB. And what is worse, the dGPU does not power off after that until reboot.
Comment 26 Eugene Shalygin 2017-04-07 19:52:57 UTC
(In reply to Eugene Shalygin from comment #25)
> Just found that dGPU wakes up when I connect an Android phone via USB.

No, any USB device makes that. kernel 4.10.8.
Comment 27 Emil Velikov 2017-04-08 16:04:44 UTC
(In reply to Eugene Shalygin from comment #26)
> (In reply to Eugene Shalygin from comment #25)
> > Just found that dGPU wakes up when I connect an Android phone via USB.
> 
> No, any USB device makes that. kernel 4.10.8.

While the side effects are similar, the issue seems unrelated. There are some suggestions in comment 23, that you might want to explode.

In either case, please keep it a separate bug report. Thank you!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.