Bug 99907

Summary: linux-firmware 2017-02-17 update causes varying breaks in AMDGPU for recent cards
Product: DRI Reporter: saunders.52
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED MOVED QA Contact:
Severity: normal    
Priority: medium CC: sasy360
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
4.9.11 kernel log with old functioning firmware.
none
4.9.11 kernel log with new malfunctioning firmware.
none
Xorg 1.19.1 log with new malfunctioning firmware.
none
Xorg 1.19.1 log with old functional firmware.
none
journal log none

Description saunders.52 2017-02-22 18:07:11 UTC
Currently on Arch Linux after they shipped an update to the linux-firmware set (20170217.12987ca-1), there's been reports of various issues ranging from power management failing to in my case (AMD Radeon RX 460) Xorg failing to work at all (it either blinks and goes back to a frozen VT as the GPU hangs, or the GPU hangs on a full-screen corruption of some kind.) This is broken on both kernel 4.9.11 and 4.10 in my testing, on Xorg 1.19.1. The system still responds to SSH connections, but fails to shutdown properly if attempted over that. Tracker link: https://bugs.archlinux.org/task/53042

I've traced it back to a specific commit to linux-firmware, 	7a110b85a46d7f884f4ac712ff52e02ed57234bd, https://git.kernel.org/cgit/linux/kernel/git/firmware/linux-firmware.git/commit/amdgpu/carrizo_ce.bin?id=7a110b85a46d7f884f4ac712ff52e02ed57234bd, pushed to the git repo on 2-17-17, which updates a large subset of the firmware images used by AMDGPU. Seeing as how this is a binary file set, I'm really not sure how to proceed from here in testing it to give any more useful information here. 

Apologies if this is the wrong place to report a firmware issue, but I was unsure where to file it otherwise.
Comment 1 saunders.52 2017-02-22 18:10:32 UTC
There's a dmesg set on the Arch bug tracker provided for the power management failure case - 
Previous firmware revision: https://bugs.archlinux.org/task/53042?getfile=15004
Current firmware revision: https://bugs.archlinux.org/task/53042?getfile=15005

Looking at my Xorg logs over SSH at the time, there were no differences to a successful useage of Xorg on the previous firmware. I wasn't thoughtful enough to take a dmesg capture, and I've got a large workload running on my machine right now. I can probably experiment with getting the logs for my case in a day or two.
Comment 2 Alex Deucher 2017-02-22 18:33:09 UTC
Does the firmware here fix the issue?

https://people.freedesktop.org/~agd5f/radeon_ucode/polaris/
Comment 3 saunders.52 2017-02-22 20:37:12 UTC
I tried the firmware you linked and the problems persisted (GPU hang when starting Xorg). I did take the opportunity and the fact the machine still responds over SSH to capture my Xorg and kernel logs, which I will attach. For the record, the symptoms are the same with AMDGPU with my standard config (DRI 3, TearFree), a blank config file, and with Modesetting.
Comment 4 saunders.52 2017-02-22 20:37:45 UTC
Created attachment 129843 [details]
4.9.11 kernel log with old functioning firmware.
Comment 5 saunders.52 2017-02-22 20:38:20 UTC
Created attachment 129844 [details]
4.9.11 kernel log with new malfunctioning firmware.
Comment 6 saunders.52 2017-02-22 20:38:49 UTC
Created attachment 129845 [details]
Xorg 1.19.1 log with new malfunctioning firmware.
Comment 7 saunders.52 2017-02-22 20:39:09 UTC
Created attachment 129846 [details]
Xorg 1.19.1 log with old functional firmware.
Comment 8 Sasan 2017-02-23 01:46:32 UTC
RX 460 user here. Same issue. Kernel panic and backtrace messages in my log file might help.
Comment 9 Sasan 2017-02-23 01:47:34 UTC
Created attachment 129852 [details]
journal log
Comment 10 Alex Deucher 2017-02-23 02:10:12 UTC
I've reverted the polaris 11 changes in the firmware git tree.  Just waiting for them to land.
Comment 11 saunders.52 2017-02-23 02:30:28 UTC
Okay, that's glad for me to hear. There's still the people on Polaris10 and others having power management failures - someone's card doubled in idle temperature.
Comment 12 saunders.52 2017-02-23 08:33:02 UTC
Arch has a testing update (linux-firmware-20170217.12987ca-2) that's the same git revison that was causing problems with the troublesome AMD commits reverted, and this has fixed both my RX 460 GPU hang and the issues with power management on an R9 Fury.
Comment 13 Alex Deucher 2017-02-23 16:27:59 UTC
Does the new firmware work properly with kernel 4.10 or newer?
Comment 14 saunders.52 2017-02-23 16:29:42 UTC
Which new firmware? The one you linked earlier in the discussion, or the new setup with the one git commit reverted?
Comment 15 Alex Deucher 2017-02-23 16:30:47 UTC
(In reply to saunders.52 from comment #14)
> Which new firmware? The one you linked earlier in the discussion, or the new
> setup with the one git commit reverted?

Either or both.
Comment 16 saunders.52 2017-02-23 16:32:32 UTC
(In reply to Alex Deucher from comment #15)
> (In reply to saunders.52 from comment #14)
> > Which new firmware? The one you linked earlier in the discussion, or the new
> > setup with the one git commit reverted?
> 
> Either or both.

The GIT commit reversion should be similar enough to a manual change I tried with both 4.9 and 4.10 I can almost certainly say it would work (trying the old firmware manually). I haven't tried the other, and won't be able to for about 5 hours (away from the desktop in question).
Comment 17 Alex Deucher 2017-02-23 16:34:32 UTC
(In reply to saunders.52 from comment #16)
> 
> The GIT commit reversion should be similar enough to a manual change I tried
> with both 4.9 and 4.10 I can almost certainly say it would work (trying the
> old firmware manually). I haven't tried the other, and won't be able to for
> about 5 hours (away from the desktop in question).

So the new firmware works in 4.10, but not in 4.9?
Comment 18 saunders.52 2017-02-23 16:35:12 UTC
(In reply to Alex Deucher from comment #17)
> (In reply to saunders.52 from comment #16)
> > 
> > The GIT commit reversion should be similar enough to a manual change I tried
> > with both 4.9 and 4.10 I can almost certainly say it would work (trying the
> > old firmware manually). I haven't tried the other, and won't be able to for
> > about 5 hours (away from the desktop in question).
> 
> So the new firmware works in 4.10, but not in 4.9?

The old firmware works in 4.10. The new firmware hasn't been tested by me outside of 4.9.
Comment 19 saunders.52 2017-02-23 16:37:44 UTC
Well, the one you linked above didn't work in 4.9. The one shipping in the repos that is getting reverted (20170217.12987ca-1) didn't work in 4.9 and 4.10. The oldest of the three (the one shipped originally as part of 20161222.4b9559f) is stable in both.

Are there some version numbers I can refer to these by to make this less insanely confusing?
Comment 20 saunders.52 2017-02-23 16:38:05 UTC
And I didn't check the one you linked in 4.10. I think.
Comment 21 Alex Deucher 2017-02-23 16:43:13 UTC
(In reply to saunders.52 from comment #19)
> Are there some version numbers I can refer to these by to make this less
> insanely confusing?

The 5th dword in each binary is the version.
Comment 22 saunders.52 2017-02-23 16:52:00 UTC
Okay, assuming I'm reading this right with hexdump... On an RX 460 (4 GB):

Old Committed Version (0080 0000): Works on 4.9 and 4.10.
New Committed Version, Now Uncommitted (0083 0000): Does not work on 4.9 and 4.10.
Download Version (0086 0000): Tested on 4.9, where it doesn't work. Probably not tested on 4.10 (I don't remember.)
Comment 23 saunders.52 2017-02-23 18:38:55 UTC
I was able to get back to the machine in question sooner than I thought.
The version you have for download in Comment 2, (0086 0000) does not work on 4.10, and has the same crash issue.
Comment 24 Martin Peres 2019-11-19 08:14:23 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/142.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.