Bug 109135

Summary: R9 390 hangs at boot with DPM/DC enabled for kernels 4.19.x and above, says KMS not supported
Product: DRI Reporter: rmuncrief
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED MOVED QA Contact:
Severity: critical    
Priority: medium CC: gta4lj
Version: XOrg git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Example of failed amdgpu boot with R9 390.
none
Example of good amdgpu boot with R9 390.
none
Kernel Log From Crash on R9 390 > 4.19 none

Description rmuncrief 2018-12-22 19:20:23 UTC
Created attachment 142874 [details]
Example of failed amdgpu boot with R9 390.

I'm running the latest Manjaro and have a Sapphire Nitro R9 390. Kernels before 4.19.x work great with amdgpu, but all kernels from 4.19.x on hang at boot, and usually don't even generate a boot log, no matter how long you wait.

However after compiling and testing the latest linux-amd-wip-git I was able to get a log so I can finally file a bug. By the way, I've gotten a few logs before but lost them while debugging, however this time I had the sense to save it.

In nay case, the error line is always the same:

AMDGPU(0): [KMS] drm report modesetting isn't supported.

This only happens if you set amdgpu.dpm=1 and/or amdgpu.dc=1 with the bad kernels. If you leave those options out amdgpu works, but things like resuming from suspend, etc. don't work.

I've attached Xorg.0.good.log to show a good boot with kernel 4.18.20, and a Xorg.0.bad.log to show the failed boot with amd-wip-git. And by the way I've tried numerous kernels over the last weeks and they all fail the same way, but if you want me to test a specific kernel I'll be happy to do so.

Here's the grub line I use for all testing:
GRUB_CMDLINE_LINUX_DEFAULT="quiet resume=UUID=b4f71480-8fe3-43c2-99c9-fc3f5687545b libata.atapi_passthru16=0 rd.modules-load=vfio-pci amd_iommu=on iommu=pt amdgpu.modeset=1 radeon.si_support=0 radeon.cik_support=0 amdgpu.si_support=1 amdgpu.cik_support=1 amdgpu.dpm=1 amdgpu.dc=1"
Comment 1 rmuncrief 2018-12-22 19:21:19 UTC
Created attachment 142875 [details]
Example of good amdgpu boot with R9 390.
Comment 2 rmuncrief 2018-12-23 19:52:00 UTC
Ooops, I accidentally overwrote the log from linux-amd-wip-git with a later failed test with 4.20.0-rc7-mainline, so the bad log I attached is for 4.20.0-rc7-mainline.

However as I said before it's been the same error all along for all kernels from 4.19.x on. I'll recompile linux-amd-wip-git and get log from their if anyone needs it though.
Comment 3 toma678 2018-12-24 21:43:36 UTC
Created attachment 142880 [details]
Kernel Log From Crash on R9 390 > 4.19

Added kernel log from crash. Error affects 4.19 and 4.20. Runs fine with same settings on 4.18. 

Line #925 looks like first error, with further errors on #943 and #944
Comment 4 Michel Dänzer 2018-12-27 16:15:55 UTC
Looks VCE related:

Dec 24 21:12:11.223675 tom-comp kernel: [drm:amdgpu_vce_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 12 test failed
Dec 24 21:12:11.223708 tom-comp kernel: [drm:amdgpu_device_init.cold.17 [amdgpu]] *ERROR* hw_init of IP block <vce_v2_0> failed -110

If you can live without hardware accelerated video encoding, amdgpu.ip_block_mask=0xff on the kernel command line might serve as a workaround.
Comment 5 i.kalvachev 2019-01-10 23:47:05 UTC
(In reply to rmuncrief from comment #2)
[...]
> Here's the grub line I use for all testing:
> GRUB_CMDLINE_LINUX_DEFAULT="quiet
> resume=UUID=b4f71480-8fe3-43c2-99c9-fc3f5687545b libata.atapi_passthru16=0
> rd.modules-load=vfio-pci amd_iommu=on iommu=pt amdgpu.modeset=1
[...]

Would you try with "iommu=soft", aka disable the hardware one?
Maybe try tuning it off entirely.
Comment 6 rmuncrief 2019-01-11 00:54:52 UTC
(In reply to iive from comment #5)
> (In reply to rmuncrief from comment #2)
> [...]
> > Here's the grub line I use for all testing:
> > GRUB_CMDLINE_LINUX_DEFAULT="quiet
> > resume=UUID=b4f71480-8fe3-43c2-99c9-fc3f5687545b libata.atapi_passthru16=0
> > rd.modules-load=vfio-pci amd_iommu=on iommu=pt amdgpu.modeset=1
> [...]
> 
> Would you try with "iommu=soft", aka disable the hardware one?
> Maybe try tuning it off entirely.

Thank you for looking into this bug. I tried all combinations of amd_iommu/iommu "soft" and "off", on kernels 4.19.13, 4.20.0, and linux-mainline from last night. And I tried them all with the BIOS iommu enabled and disabled, and also leaving and removing iommu=pt.

In all cases the boot failed in the same way, with a black screen and complete lockup so that even ssh from another terminal would not work. And unfortunately none of them generated any type of boot log that I could find. In fact if you delete all the /var/log/Xorg.* log files, and then try booting with the bad kernels, not even an empty Xorg.*.old log file is generated. The only log that appears after I finally change back to 4.18.20 is the single Xorg.0.log from its boot. And of course there's nothing at all in journalctl.

In any case I'm an embedded systems designer, but only have a cursory knowledge of low level Linux kernel development. However I'm willing to invest whatever time is necessary to help find this bug. I bought an expensive R9 390 three years ago and have only been able to fully utilize it for about six months off and on.

By the way, I'd have no problem sticking with the 4.18 kernel as everything works great with it, and I've heard that same sentiment from others. It really seems that for whatever reasons there were systemic errors introduced after it, and they will take a substantial amount of time to fix. Our only concern is that 4.18 will soon lose support. So given the unprecedented number of problems with all later kernels it may be wise to move it to LTS, and give developers more time to work on the plethora of problems they're dealing with now.
Comment 7 Alex Deucher 2019-01-11 01:09:14 UTC
Can you bisect to figure out what commit broke things for you?
Comment 8 rmuncrief 2019-01-11 01:25:24 UTC
(In reply to Alex Deucher from comment #7)
> Can you bisect to figure out what commit broke things for you?

Actually I remember doing that many years ago when I was a maintainer for Steam under wine. I'll look and see if I can find a current bisect tutorial and give it a try. Any links or tips you can give to help give me a quick start would be appreciated. I do remember it can take many days, which I'm willing to invest as I said. However the fewer days the better! :)
Comment 9 Alex Deucher 2019-01-11 15:30:34 UTC
(In reply to rmuncrief from comment #8)
> (In reply to Alex Deucher from comment #7)
> > Can you bisect to figure out what commit broke things for you?
> 
> Actually I remember doing that many years ago when I was a maintainer for
> Steam under wine. I'll look and see if I can find a current bisect tutorial
> and give it a try. Any links or tips you can give to help give me a quick
> start would be appreciated. I do remember it can take many days, which I'm
> willing to invest as I said. However the fewer days the better! :)

It's pretty straight forward.  Just google for "kernel git bisect howto".
Comment 10 i.kalvachev 2019-01-11 16:29:34 UTC
(In reply to Alex Deucher from comment #9)
> (In reply to rmuncrief from comment #8)
> > (In reply to Alex Deucher from comment #7)
> > > Can you bisect to figure out what commit broke things for you?
> > 
> > Actually I remember doing that many years ago when I was a maintainer for
> > Steam under wine. I'll look and see if I can find a current bisect tutorial
> > and give it a try. Any links or tips you can give to help give me a quick
> > start would be appreciated. I do remember it can take many days, which I'm
> > willing to invest as I said. However the fewer days the better! :)
> 
> It's pretty straight forward.  Just google for "kernel git bisect howto".

Bisecting between two major stable kernel versions is a nightmare.(Aka 4.18.0 - 4.19.0)

Most of the new changes are done before RC1 and it is quite common that there are major breakages there, in systems we do not want to bother with. These breakages are usually fixed (or reverted) in later Release Candidates.

It may be better to use some of the drm-next repositories, assuming that they hold all graphic changes that are going upstream, but are not rebased until the final release is done.
Comment 11 rmuncrief 2019-01-11 20:10:15 UTC
(In reply to iive from comment #10)
> (In reply to Alex Deucher from comment #9)
> > (In reply to rmuncrief from comment #8)
> > > (In reply to Alex Deucher from comment #7)
> > > > Can you bisect to figure out what commit broke things for you?
> > > 
> > > Actually I remember doing that many years ago when I was a maintainer for
> > > Steam under wine. I'll look and see if I can find a current bisect tutorial
> > > and give it a try. Any links or tips you can give to help give me a quick
> > > start would be appreciated. I do remember it can take many days, which I'm
> > > willing to invest as I said. However the fewer days the better! :)
> > 
> > It's pretty straight forward.  Just google for "kernel git bisect howto".
> 
> Bisecting between two major stable kernel versions is a nightmare.(Aka
> 4.18.0 - 4.19.0)
> 
> Most of the new changes are done before RC1 and it is quite common that
> there are major breakages there, in systems we do not want to bother with.
> These breakages are usually fixed (or reverted) in later Release Candidates.
> 
> It may be better to use some of the drm-next repositories, assuming that
> they hold all graphic changes that are going upstream, but are not rebased
> until the final release is done.

Yes, that's the kind of stuff I was worried about. I understand bisect is simple in theory, but in practice there are usually a lot of problems.

And I already ran into my first and am redoing everything from scratch again. I was able to cobble together a working PKGBUILD and tested it by compiling 4.18.20 from stable-branch git without any of the ARCH or Manjaro patches. I did some quick tests with my compiled 4.18.20 to make sure it was working, and then built 4.19.14 and made sure that one didn't work.

But when I went to do the first bisect it complained about uncommitted files and aborted. I tried various things and eventually discovered that I probably should have just executed "git stash", but by that time I couldn't be sure of the integrity of the build so I just started all over again this morning.

However now I'm more confused because it seems you might be saying I should be using a 4.19 release candidate instead of a stable build? Like I said, I'm willing to put in substantial effort to help fix this problem, but I'd like to waste the least amount of time possible. I also understand there are various ways to speed things up by disabling unused kernel features, caching, etc. but I'm from the old school of engineering and am hesitant to modify anything I'm testing unless absolutely necessary. After all, since it's a completely unknown problem there's no way to know what's causing it. It could very well be something completely unexpected that has to do with components one wouldn't normally consider.

And by the way, by "old school" I mean I started out as a hardware/firmware designer when the most advanced processors were 8 bits running at 2Mhz! Yes, that's right, I'm freakin' old! :)

In any case at this point if someone could tell me whether I should be targeting a stable release or release candidate it would be helpful.
Comment 12 Alex Deucher 2019-01-11 20:38:47 UTC
(In reply to iive from comment #10)
> (In reply to Alex Deucher from comment #9)
> > (In reply to rmuncrief from comment #8)
> > > (In reply to Alex Deucher from comment #7)
> > > > Can you bisect to figure out what commit broke things for you?
> > > 
> > > Actually I remember doing that many years ago when I was a maintainer for
> > > Steam under wine. I'll look and see if I can find a current bisect tutorial
> > > and give it a try. Any links or tips you can give to help give me a quick
> > > start would be appreciated. I do remember it can take many days, which I'm
> > > willing to invest as I said. However the fewer days the better! :)
> > 
> > It's pretty straight forward.  Just google for "kernel git bisect howto".
> 
> Bisecting between two major stable kernel versions is a nightmare.(Aka
> 4.18.0 - 4.19.0)
> 
> Most of the new changes are done before RC1 and it is quite common that
> there are major breakages there, in systems we do not want to bother with.
> These breakages are usually fixed (or reverted) in later Release Candidates.

This is not always the case; in most cases bisects are pretty smooth.  If you run into unrelated problems with a particular commit, you can always skip it during the bisect (git bisect skip).
Comment 13 rmuncrief 2019-01-11 20:44:41 UTC
(In reply to Alex Deucher from comment #12)
> (In reply to iive from comment #10)
> > (In reply to Alex Deucher from comment #9)
> > > (In reply to rmuncrief from comment #8)
> > > > (In reply to Alex Deucher from comment #7)
> > > > > Can you bisect to figure out what commit broke things for you?
> > > > 
> > > > Actually I remember doing that many years ago when I was a maintainer for
> > > > Steam under wine. I'll look and see if I can find a current bisect tutorial
> > > > and give it a try. Any links or tips you can give to help give me a quick
> > > > start would be appreciated. I do remember it can take many days, which I'm
> > > > willing to invest as I said. However the fewer days the better! :)
> > > 
> > > It's pretty straight forward.  Just google for "kernel git bisect howto".
> > 
> > Bisecting between two major stable kernel versions is a nightmare.(Aka
> > 4.18.0 - 4.19.0)
> > 
> > Most of the new changes are done before RC1 and it is quite common that
> > there are major breakages there, in systems we do not want to bother with.
> > These breakages are usually fixed (or reverted) in later Release Candidates.
> 
> This is not always the case; in most cases bisects are pretty smooth.  If
> you run into unrelated problems with a particular commit, you can always
> skip it during the bisect (git bisect skip).

Okay. I'm recompiling the first bisect now, there were no problems starting it this time after using "git stash." However I didn't issue "make clean" because it's unclear if it's necessary and I'm hoping I don't have to recompile everything for each iteration. Is it necessary or not? The instructions are vague and simple say you "might" have to. How would one know if they should? Does the compile fail if you need to, or can silent problems be introduced if you don't?
Comment 14 Alex Deucher 2019-01-11 20:49:22 UTC
On Fri, Jan 11, 2019 at 3:44 PM <bugzilla-daemon@freedesktop.org> wrote:
>
> Okay. I'm recompiling the first bisect now, there were no problems starting it
> this time after using "git stash." However I didn't issue "make clean" because
> it's unclear if it's necessary and I'm hoping I don't have to recompile
> everything for each iteration. Is it necessary or not? The instructions are
> vague and simple say you "might" have to. How would one know if they should?
> Does the compile fail if you need to, or can silent problems be introduced if
> you don't?
>
 Generally it's not required to make clean every time unless you run
into a build related problem.
Comment 15 rmuncrief 2019-01-12 21:46:59 UTC
(In reply to Alex Deucher from comment #7)
> Can you bisect to figure out what commit broke things for you?

Okay, here is the result from bisect:

b92c628712ed3a1cf5d4a144290e8ffc170bf51e is the first bad commit
commit b92c628712ed3a1cf5d4a144290e8ffc170bf51e
Author: Rex Zhu <Rex.Zhu@amd.com>
Date:   Tue Jun 5 13:06:11 2018 +0800

    drm/amd/pp: Unify powergate_uvd/vce/mmhub to set_powergating_by_smu
    
    Some HW ip blocks need call SMU to enter/leave power gate state.
    So export common set_powergating_by_smu interface.
    
    1. keep consistent with set_clockgating_by_smu
    2. scales easily to powergate other ip(gfx) if necessary
    
    Reviewed-by: Evan Quan <evan.quan@amd.com>
    Signed-off-by: Rex Zhu <Rex.Zhu@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

:040000 040000 cf80741fafed463d93a0d41d85f267ff8c3859ca 2daca62d7af472048be8e0911c39e9a26fe376c0 M	drivers


Let me know if there's anything more I can do to help. I'll take a look at the code but like I said I'm not a kernel developer so it's unlikely I'll understand it or be able to fix it.
Comment 16 Alex Deucher 2019-01-14 15:45:43 UTC
Does removing amdgpu.dpm=1 from the kernel command line fix the issue?
Comment 17 rmuncrief 2019-01-14 17:53:09 UTC
Yes, As I said in the beginning if you don't set amdgpu.dpm=1 the system boots but resume doesn't work. I presume it's because power management isn't working. All kernels before 4.19.x work great, and have no problem with resume so long as dpm is enabled.
Comment 18 Alex Deucher 2019-01-14 19:19:25 UTC
(In reply to rmuncrief from comment #17)
> Yes, As I said in the beginning if you don't set amdgpu.dpm=1 the system
> boots but resume doesn't work. I presume it's because power management isn't
> working. All kernels before 4.19.x work great, and have no problem with
> resume so long as dpm is enabled.

dynamic power management is always enabled by default.  the only way to disable it is to set dpm=0.  dpm=1 either does nothing or changes the implementation used in the driver depending on the asic.  You shouldn't set dpm= unless you are debugging something. For your asic, kernels 4.18 and older the default was the original dpm implementation; setting dpm=1 used the newer powerplay code.  For kernel 4.19 and newer, the default implementation switched to the newer powerplay code and setting dpm=1 switched back to the older implementation.  So removing dpm=1 is correct for 4.19 and newer.  Can you elaborate on the resume issue?
Comment 19 rmuncrief 2019-01-14 20:04:40 UTC
Oh, I see. I wasn't aware of the change. I wondered in the beginning if things were different because I knew the explicit dc and dpm settings were a temporary work around, but unfortunately I didn't come across any explicit notices about whether or not they were still used.

In any case, I'd tried both setting amdgpu.dpm=0 and omitting it completely and that's when I discovered I could boot without it (if I remember correctly I believe it wouldn't boot with dpm=0 either). However the problem is that my computer doesn't resume from suspend with any kernels from 4.19.x on so I assumed display power management wasn't enabled. But if I understand you correctly it seems it's enabled but not working with resume for some reason.

And by the way, thank you for working on this problem so diligently with me. I've had a lot of problems with my beloved R9 390 and it's nice to see it getting the attention it deserves! :)
Comment 20 i.kalvachev 2019-01-16 14:27:14 UTC
(In reply to Alex Deucher from comment #12)
> (In reply to iive from comment #10)
[...]
> > Most of the new changes are done before RC1 and it is quite common that
> > there are major breakages there, in systems we do not want to bother with.
> > These breakages are usually fixed (or reverted) in later Release Candidates.
> 
> This is not always the case; in most cases bisects are pretty smooth.  If
> you run into unrelated problems with a particular commit, you can always
> skip it during the bisect (git bisect skip).

Let's say that they merge a change that breaks booting on my motherboard and then they merge the radeon repo. Then they fix booting at rc2.

I will have to `git bisect skip` all commits between the boot-breaking merge and rc2, as they all will produce broken kernels. Since all/most suspected radeon commits are in this range, you can't bisect them.

One tick to avoid such problem is to do mass revert. Starting from the final release (e.g. 4.19.0) and then reverting all commits in a given subsystem (e.g. amdgpu) up to the previous release. Then doing the bisect in the reverted commits.
If there are no interlinked changes, this method mostly works. But even small cosmetic changes or simple API modification outside the subsystem can complicate things.

Anyway, bisect was done successfully.

I just still kind of don't understand why it landed on that commit.
It doesn't look like this is the commit that changes the default amdgpu.dpm method.

Also, why not use dpm=1 for DPM and dpm=2 for PowerPlay?
Comment 21 Alex Deucher 2019-01-16 16:45:30 UTC
(In reply to iive from comment #20)
> Anyway, bisect was done successfully.
> 
> I just still kind of don't understand why it landed on that commit.
> It doesn't look like this is the commit that changes the default amdgpu.dpm
> method.

The default change may have happened before that change and that change may have broken the old dpm code since presumably dpm=1 was set while bisecting.
Comment 22 i.kalvachev 2019-01-16 17:57:17 UTC
(In reply to Alex Deucher from comment #21)
> (In reply to iive from comment #20)
> > Anyway, bisect was done successfully.
> > 
> > I just still kind of don't understand why it landed on that commit.
> > It doesn't look like this is the commit that changes the default amdgpu.dpm
> > method.
> 
> The default change may have happened before that change and that change may
> have broken the old dpm code since presumably dpm=1 was set while bisecting.

You haven't look at the commit.

It is code refactoring. It doesn't remove, add or modify any functionality. It just changes how some functions are called. (1 function pointer and switch/case, instead of 3 function pointers.)
I honestly could not spot what might be wrong with it.
Comment 23 Alex Deucher 2019-01-16 18:58:15 UTC
(In reply to iive from comment #22)
> 
> It is code refactoring. It doesn't remove, add or modify any functionality.
> It just changes how some functions are called. (1 function pointer and
> switch/case, instead of 3 function pointers.)
> I honestly could not spot what might be wrong with it.

this changed:
@@ -1751,10 +1751,10 @@ void amdgpu_dpm_enable_uvd(struct amdgpu_device *adev, bool enable)
 
 void amdgpu_dpm_enable_vce(struct amdgpu_device *adev, bool enable)
 {
-       if (adev->powerplay.pp_funcs->powergate_vce) {
+       if (adev->powerplay.pp_funcs->set_powergating_by_smu) {

CI asic never had a powergate_vce callback before, so that code was never called for vce previously, at least for the old dpm implementation.  For the new on, it actually had a callback for vce powergating, but perhaps there was a bug in that code around the time this code was changed.
Comment 24 Martin Peres 2019-11-19 09:08:59 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/653.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.