Bug 107955

Summary:

AMDGPU driver keeps reloading on hybrid graphics system causing stuttering.

Product:

DRI

Reporter:

Ransu <gero3977>

Component:

DRM/AMDgpu

Assignee:

Default DRI bug account <dri-devel>

Status:

RESOLVED FIXED

QA Contact:

Severity:

critical

Priority:

highest

CC:

gero3977, mike

Version:

unspecified

Hardware:

x86-64 (AMD64)

OS:

Linux (All)

Whiteboard:

i915 platform:

i915 features:

Attachments:

Description	Flags
Xorg.0.log (log 1)	none
xrandr information (log 1)	none
dmesg -w (log 1)	none
xorg.config (log 1)	none
pref report --header	none
Perf data	none
Report of amdgpu:*	none
Using libunwind	none
report with amdgpu DDX	none
report with amdgpu DDX with debugging	none
mpv debugging log	none
perf report with mpv debugging	none

Description Ransu 2018-09-17 04:52:52 UTC

Hi,

My system at the time of writing is a Lenovo Y40-80 with a Intel i7-5500U and a AMD Radeon R9 M275 2GB. The OS I have installed is Arch Linux installed.


Linux Y40-80 4.18.7-arch1-1-ARCH #1 SMP PREEMPT Sun Sep 9 11:27:58 UTC 2018 x86_64 GNU/Linux

Right now it seems AMDGPU is constantly restarting itself and this reloading is causing stuttering. I keep seeing the following two lines constantly repeating in my 'dmesg' log.

[drm] PCIE gen 2 link speeds already enabled
amdgpu 0000:05:00.0: PCIE GART of 1024M enabled (table at 0x000000F400000000).



My goal is to make use of the integrated Intel graphics for everything and use the AMD graphics when needed with PRIME. This sort of works but I don't know why I'm getting those messages repeated over and over with stuttering.

Comment 1 Ransu 2018-09-17 04:58:28 UTC

By stuttering I mean every animation stops for a second or so. This happens every few seconds, it's especially annoying during video playback.

Comment 2 Michel Dänzer 2018-09-17 09:56:41 UTC

The messages appear when the AMD GPU is woken up due to userspace using some functionality of the amdgpu kernel driver.

Does

 amdgpu.runpm=0

on the kernel command line prevent the stuttering? If so, I hope somebody can help you track down how userspace is causing the AMD GPU to wake up.

(Note that the above will keep the AMD GPU powered on all the time; it's intended to confirm that the stuttering is related to powering up the AMD GPU, not as a permanent solution)

Comment 3 Michel Dänzer 2018-09-17 09:58:41 UTC

Please attach the Xorg log file and the output of dmesg and xrandr.

Comment 4 Ransu 2018-09-17 10:48:24 UTC

Created attachment 141597 [details]
Xorg.0.log (log 1)

Comment 5 Ransu 2018-09-17 10:49:34 UTC

Created attachment 141598 [details]
xrandr information (log 1)

Comment 6 Ransu 2018-09-17 10:50:28 UTC

Created attachment 141599 [details]
dmesg -w      (log 1)

Comment 7 Ransu 2018-09-17 10:52:15 UTC

Created attachment 141600 [details]
xorg.config   (log 1)

Comment 8 Ransu 2018-09-17 10:52:30 UTC

All attachments for this comment was marked "log 1"   This includes my current xorg.conf

At the time of the attachments the kernel command line was as follows with sensitive information left out.

BOOT_IMAGE=/vmlinuz-linux root=UUID=<REDACTED> rw cryptdevice=/dev/disk/by-uuid/<REDACTED> radeon.si_support=0 amdgpu.si_support=1 memmap=10M$2245M quiet resume=UUID=<REDACTED>

Comment 9 Ransu 2018-09-17 11:05:55 UTC

Adding

amdgpu.runpm=0

helps big time but I know this means both the AMD and Intel would be running all the time, this is not ideal. As I said before I would like to have the Intel GPU running as my main and only activating the AMD GPU when I want to make use of a better GPU, preferably with PRIME.

Comment 10 Mike Lothian 2018-09-17 12:30:25 UTC

I'm not seeing stuttering, but I do see AMDGPU loading up each time I play a video on MPV or Chromium

The last time I saw this something was using drmGetDevice rather than drmGetDevice2

Comment 11 Mike Lothian 2018-09-17 12:36:34 UTC

I did a quick grep on libraries that contain drmGetDevice and drmGetDevice2 and did a diff

-Binary file /usr/lib64/libva-drm.so.2.200.0 matches
@@ -6 +4,0 @@
-Binary file /usr/lib64/libva-wayland.so.2.200.0 matches
@@ -13,3 +10,0 @@
-Binary file /usr/lib64/xorg/modules/drivers/modesetting_drv.so matches
-Binary file /usr/lib64/xorg/modules/drivers/amdgpu_drv.so matches
-Binary file /usr/lib64/xorg/modules/libglamoregl.so matches

My guess they're the most likely candidates for this happening

Comment 12 Ransu 2018-09-17 14:23:07 UTC

If this is a library file issue how should I go fixing this?   Does this need a upstream or mainline fix?

Comment 13 Mike Lothian 2018-09-17 14:53:33 UTC

I'm not sure, I'm hoping I might be pointing the devs in the right direction

I'm not sure if drmGetDeviceNameFromFd vs drmGetDeviceNameFromFd2 could cause the issue too - I think it might

I found in the xserver:

hw/xfree86/dri2/dri2.c:    if (drmGetDevice(info->fd, &dev) || dev->bustype != DRM_BUS_PCI) {
hw/xfree86/drivers/modesetting/dri2.c:    info.deviceName = drmGetDeviceNameFromFd(ms->fd);

drm/xf86drm.c:int drmGetDevices(drmDevicePtr devices[], int max_devices)

In the AMDGPU DDX:

src/amdgpu_dri2.c:      info->dri2.device_name = drmGetDeviceNameFromFd(pAMDGPUEnt->fd);

And in libva:

va/drm/va_drm_utils.c:    name = drmGetDeviceNameFromFd(fd);

I notice in the old libva1 code there was no drmGetDevice stuff and it's only in libva2 I could find the above reference

Comment 14 Mike Lothian 2018-09-17 14:55:38 UTC

Tonight when I have access to my laptop I'll try switching those two the '2' versions and see if it stops the issues, unless anyone else has any better ideas

Comment 15 Michel Dänzer 2018-09-17 15:52:26 UTC

(In reply to Mike Lothian from comment #13)
> hw/xfree86/dri2/dri2.c:    if (drmGetDevice(info->fd, &dev) || dev->bustype
> != DRM_BUS_PCI) {
> hw/xfree86/drivers/modesetting/dri2.c:    info.deviceName =
> drmGetDeviceNameFromFd(ms->fd);
> 
> [...]
> 
> src/amdgpu_dri2.c:      info->dri2.device_name =
> drmGetDeviceNameFromFd(pAMDGPUEnt->fd);

These are only called during X server startup.


> va/drm/va_drm_utils.c:    name = drmGetDeviceNameFromFd(fd);

This should only be called when a video player using VA-API runs standalone, not via X (or Wayland), and even then only once.


Try running

 sudo perf record -e rpm:rpm_resume --call-graph=dwarf

in a terminal, then do whatever is needed to reproduce the problem, then interrupt the perf command with Ctrl-C and attach the output of

 sudo perf report --header

Comment 16 Mike Lothian 2018-09-18 08:06:39 UTC

Weird it was only showing i915 resumes no amdgpu ones - even though dmesg clearly shows the card powering up

Comment 17 Mike Lothian 2018-09-18 09:32:26 UTC

Created attachment 141631 [details]
pref report --header

Comment 18 Mike Lothian 2018-09-18 09:36:00 UTC

Created attachment 141632 [details]
Perf data

So the --header didn't show anything, however the raw data does seem to do something amdgpu releated

Comment 19 Mike Lothian 2018-09-18 09:56:39 UTC

Created attachment 141633 [details]
Report of amdgpu:*

I've repeated but using -e amdgpu:*

Comment 20 Mike Lothian 2018-09-18 10:37:54 UTC

Created attachment 141635 [details]
Using libunwind

Comment 21 Michel Dänzer 2018-09-18 14:24:07 UTC

Comment on attachment 141635 [details]
Using libunwind

Ransu, please try to get the information from your system the same way Mike did. Looks like he's running into a different issue which only happens using the Xorg modesetting driver.

Comment 22 Mike Lothian 2018-10-05 09:15:20 UTC

Created attachment 141907 [details]
report with amdgpu DDX

this is still happening with the amdgpu DDX

Comment 23 Mike Lothian 2018-10-05 09:49:57 UTC

Created attachment 141908 [details]
report with amdgpu DDX with debugging

Comment 24 Mike Lothian 2018-10-05 10:22:58 UTC

Created attachment 141910 [details]
mpv debugging log

Comment 25 Mike Lothian 2018-10-05 10:24:20 UTC

Created attachment 141911 [details]
perf report with mpv debugging

Comment 26 Ransu 2018-11-08 15:58:13 UTC

Sorry for the lack of updates, life got in the way and then this week I made a stupid mistake on where I was sending data with 'dd'. I didn't lose any important data but it did force me to have to setup my laptop from scratch.

Good news! As of kernel 4.18.16 I no longer see an issue, knock on wood. I'm going to give it a week and report back but the AMDGPU driver does not seem to be reloading like mad anymore. 

Linux Y40-80 4.18.16-arch1-1-ARCH #1 SMP PREEMPT Sat Oct 20 22:06:45 UTC 2018 x86_64 GNU/Linux


I get the following two messages whenever I request the AMD GPU with "PRIME_DRI=1" and only when I request the AMD GPU. When I'm not making use of the discrete graphics and only the dedicated Intel of my laptop I do not see the below two messages repeated over and over anymore, nor do I see any stuttering. 

> [drm] PCIE gen 2 link speeds already enabled
> amdgpu 0000:05:00.0: PCIE GART of 1024M enabled (table at 0x000000F400000000).



One other thing I should note before I set up the system with AMDGPU by putting the following kernel command line arguments into my grub config "radeon.si_support=0 amdgpu.si_support=1"    radeon alone would crash my system. I needed to go into a different TTY before login into XFCE to get things setup following this page https://wiki.archlinux.org/index.php/AMDGPU 

>Nov 08 00:56:06 Y40-80 kernel: radeon 0000:05:00.0: fence driver on ring 4 use >gpu addr 0x0000000080000c10 and cpu addr 0x000000006a0bf82f
>Nov 08 00:56:06 Y40-80 kernel: radeon 0000:05:00.0: fence driver on ring 5 use >gpu addr 0x0000000000075a18 and cpu addr 0x000000003d1f62f3
>Nov 08 00:56:06 Y40-80 kernel: radeon 0000:05:00.0: failed VCE resume (-22).
>Nov 08 00:56:07 Y40-80 kernel: [drm:r600_ring_test [radeon]] *ERROR* radeon: >ring 0 test failed (scratch(0x850C)=0xCAFEDEAD)
>Nov 08 00:56:07 Y40-80 kernel: [drm:si_resume [radeon]] *ERROR* si startup >failed on resume
>Nov 08 00:56:22 Y40-80 kernel: [drm:atom_op_jump [radeon]] *ERROR* atombios >stuck in loop for more than 5secs aborting
>Nov 08 00:56:22 Y40-80 kernel: [drm:atom_execute_table_locked [radeon]] *ERROR* >atombios stuck executing C078 (len 237, WS 0, PS 4) @ 0xC086
>Nov 08 00:56:22 Y40-80 kernel: [drm:atom_execute_table_locked [radeon]] *ERROR* >atombios stuck executing B99E (len 78, WS 12, PS 8) @ 0xB9D7






So now that things appear to be working I just have a few more questions.


Does this mean that the discrete GPU should be making use of power saving features and shouldn't be draining too much power if I'm not making use of it?

and

Does my card support AMDGPU-PRO drivers? If so is there any real advantage of using the "PRO" extras over the standard open source driver?

Comment 27 Ransu 2018-11-08 16:05:34 UTC

Oh I also did add the kernel modules as follows to my mkinitcpio configuration in case that helped any, first three are for the two GPU my laptop has and the rest are for the encrypted disk.

MODULES="i915 amdgpu radeon dm_mod dm_crypt ext4 aes_x86_64 sha256 sha512"

Comment 28 Alex Deucher 2018-11-08 16:28:16 UTC

(In reply to Ransu from comment #26)
> 
> Does my card support AMDGPU-PRO drivers? If so is there any real advantage
> of using the "PRO" extras over the standard open source driver?

You only need the "PRO" driver if you need OpenGL that is certified for workstation applications or OpenCL.

Comment 29 Ransu 2018-11-16 14:01:59 UTC

(In reply to Mike Lothian from comment #22)
> Created attachment 141907 [details]
> report with amdgpu DDX
> 
> this is still happening with the amdgpu DDX

Are you still having issues?

Comment 30 Ransu 2018-11-21 05:08:20 UTC

I'm closing this as fixed. The latest code is working for me. I have now upgraded to kernel 4.19.2 in Arch Linux and things appear to continue to work as expected.


Linux Y40-80 4.19.2-arch1-1-ARCH #1 SMP PREEMPT Tue Nov 13 21:16:19 UTC 2018 x86_64 GNU/Linux

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.