Bug 104481

Summary: GPU lockup Polaris 11 - AMD RX 460 and RX 550 on amd64 and on ARMv7 platforms while playing video
Product: Mesa Reporter: Luis Mendes <luis.p.mendes>
Component: Drivers/Gallium/radeonsiAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED MOVED QA Contact: Default DRI bug account <dri-devel>
Severity: major    
Priority: medium    
Version: git   
Hardware: All   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: dmesg and iomem data from lockup obtained with glretrace
Processes listing and gdb backtraces for all threads - glretrace lockup
dmesg and iomem data from lockup obtained with kodi on amd64
Processes listing and gdb backtraces for all threads - kodi amd64 lockup
Kernel hung task backtrace from GPU hang caused by glretrace replay

Description Luis Mendes 2018-01-03 17:04:37 UTC
Created attachment 136527 [details]
dmesg and iomem data from lockup obtained with glretrace

I am getting GPU lockups while playing video on Kodi, but it also happened with other applications that play video, while OpenGL seems to be stable.
The system seem to be more sensitive to VP9 encoded videos. The freeze happens both on amd64 as well as on armv7l platforms.
I am also able to reproduce GPU hangs on amd64 while replaying a glretrace obtained with kodi on arm platform.

The arm dmesg and traces show a clear GPU lockup, while amd64 dmesg isn't so clear, but the user experience is just the same, complete graphical system freeze, while machine is still working with ssh or remote connections.

Please find amd64 logs in attachments, including iomem, dmesg and gdb traces.

In both platforms I am using Ubuntu 17.10 with Mate desktop, and lightdm session manager, with libdrm-2.4.89, mesa-17.4 at commit "radv: Implement binning on GFX9." - 6a36bfc64d2096aa338958c4605f5fc6372c07b8 and kernel https://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-4.16 at commit "drm/amdgpu: Correct the IB size of bo update mapping." - 	104bd2ca1124dfd9aa904d5f5a96253ef2b580f6.

Please note that the system was more stable a few weeks ago with drm-next-4.16 based on kernel 4.15-rc2, and a previous mesa version, I don't remember the actual commits, but despite it was more stable, both on arm as well as on amd64, both systems still crashed similarly, it just got more evident with these new versions.

There are two distinct crash behaviours on amd64: the ones that I obtained while playing a video with kodi on amd64 and those that I obtained on amd64 by replaying an apitrace from the arm platform while playing a VP9 video with kodi.

The first kind of crashes is detailed with logs kodi-processes_and_backtraces.txt and kodi-amdgpu_lockup_dmesg_and_iomem.txt.
The second kind of crashes is detailed with logs glretrace-processes_and_backtraces.txt and glretrace-amdgpu_lockup_dmesg_and_iomem.txt.

For some strange reason the amd64 platform is complaining about polaris11 firmware files, but they are in /lib/firmware and they taken by cloning https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git. I am using the same firmware files on armv7l and the same graphics card and it doesn't complain with the firmware. 

I can also provide the apitrace trace file, but it takes around 1GB of data.
Comment 1 Luis Mendes 2018-01-03 17:07:12 UTC
Created attachment 136528 [details]
Processes listing and gdb backtraces for all threads - glretrace lockup

This is the process listing and gdb backtraces for all glretrace threads upon GPU hang caused by replaying with glretrace the apitrace obtained on arm platform from kodi playing a VP9 encoded video.
Comment 2 Luis Mendes 2018-01-03 17:08:56 UTC
Created attachment 136529 [details]
dmesg and iomem data from lockup obtained with kodi on amd64

This attachment contains the dmesg and iomem information retrieved after the GPU lockup occurred when playing a VP9 encoded video with kodi directly on amd64 platform.
Comment 3 Luis Mendes 2018-01-03 17:10:35 UTC
Created attachment 136530 [details]
Processes listing and gdb backtraces for all threads - kodi amd64 lockup

This attachment contains the processes listing and gdb backtraces for all kodi threads, that were retrieved after the GPU lockup occurred when playing a VP9 encoded video with kodi directly on amd64 platform.
Comment 4 Luis Mendes 2018-01-03 17:50:14 UTC
Created attachment 136532 [details]
Kernel hung task backtrace from GPU hang caused by glretrace replay

This attachment contains the first print of the kernel backtrace with the hung caused by GPU hang when replaying the apitrace of the armv7l playing the VP9 video with kodi.
Comment 5 Julien Isorce 2018-08-09 14:56:21 UTC
(In reply to Luis Mendes from comment #0)
> I can also provide the apitrace trace file, but it takes around 1GB of data.

Just provide it through google drive or other similar way, see https://bugs.freedesktop.org/show_bug.cgi?id=94900#c15
Comment 6 Luis Mendes 2018-08-09 15:40:44 UTC
(In reply to Julien Isorce from comment #5)
> (In reply to Luis Mendes from comment #0)
> > I can also provide the apitrace trace file, but it takes around 1GB of data.
> 
> Just provide it through google drive or other similar way, see
> https://bugs.freedesktop.org/show_bug.cgi?id=94900#c15

I haven't sent updates on this issue for a while, but this is now more diverse, that is, on the amd64 platforms (TYAN S7002, TYAN S7025) that I have, I am getting trouble for the amdgpu driver to load, and when I am able to do so, it runs into a GPU lockup as soon at it tries to enter into graphical X session mode. That has been like so for kernels linux-4.16.x, 4.17.x and 4.18-rcX.
Please see
https://lists.freedesktop.org/archives/amd-gfx/2018-July/023925.html

On armhf the story has been different... I was able to have a working configuration with Ubuntu 17.10, kernel 4.17.6 and kodi 17.3, however, the same kernel with Ubuntu 18.04 and kodi 17.6 made the problem reappear. I switched to kernel-4.18-rc8 and the problem went away again. I can provide an apitrace for 4.17.6 if desired, but it looks like it is fixed with kernel 4.18. 

From my side, I am now more concerned with my amd64 platforms, as I am simply unable to use the AMD gpus.

Please advise.
Comment 7 GitLab Migration User 2019-09-25 18:02:17 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1297.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.