Bug 82889

Summary: [drm:si_dpm_set_power_state] *ERROR* si_disable_ulv failed
Product: DRI Reporter: mmstickman
Component: DRM/RadeonAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED MOVED QA Contact:
Severity: normal    
Priority: medium CC: alexandre.f.demers, alexdeucher
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg
none
Radeon dpm hang
none
Radeon dpm success
none
dmesg | grep drm
none
journalctl log
none
disable ulv state on SI
none
dmesg | grep drm none

Description mmstickman 2014-08-21 04:03:33 UTC
Yet another bug I've encountered on my Radeon HD 7950 with kernel 3.16.
Comment 1 Michel Dänzer 2014-08-21 05:57:04 UTC
Please attach the dmesg output.
Comment 2 mmstickman 2014-08-21 05:59:46 UTC
Created attachment 105003 [details]
dmesg
Comment 3 Alex Deucher 2014-08-21 13:16:32 UTC
Is this a regression?  If so, can you bisect?
Comment 4 mmstickman 2014-08-21 18:03:00 UTC
I'm not sure if it is a regression or simply a new feature that doesn't work. I don't recall seeing the message in kernel 3.14 or prior. I don't have any experience in bisecting.
Comment 5 Samir Ibradžić 2014-08-30 08:19:26 UTC
Created attachment 105455 [details]
Radeon dpm hang
Comment 6 Samir Ibradžić 2014-08-30 08:20:10 UTC
Created attachment 105456 [details]
Radeon dpm success
Comment 7 Samir Ibradžić 2014-08-30 08:20:39 UTC
I see this happening on Radeon HD 7950, kernel 3.13, which has the radeon dpm enabled by dafault. It is intermittent, happens on ~30% boots, and causes hang followed by reboot (no panic or oops msgs). Unfortunately, I could only catch the dmesg output via serial, when my machine hangs, kernel logs are not even saved to the disk.

Now, I see "[drm:si_dpm_set_power_state] *ERROR* si_disable_ulv failed" on serial only when ignore_loglevel kernel parameter is unset. Machine will hang ad reboot each time I spot it.
I attached here the dmesg with ignore_loglevel and drm.debug=1 params, both failure and ok cases, for comparison. When failing, the machine just hangs breefly, and reboots, right after "[drm]    pitch is 7680" message.

With radeon.dpm=0 parameter, this problem NEVER happns!
Comment 8 Lorenzo Bona 2014-09-25 19:11:11 UTC
Same warning here with 3.17rc5. Building from this repository

http://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-fixes-3.17
Comment 9 Lorenzo Bona 2014-09-25 19:12:28 UTC
Created attachment 106873 [details]
dmesg | grep drm
Comment 10 Alexandre Demers 2014-09-26 00:42:56 UTC
Just moved from a 6950 (r600g) to a 7950 (radeonsi) and hit the same error on kernel 3.17-rc6.
Comment 11 Alexandre Demers 2014-09-26 00:49:00 UTC
Is ULV standing for Ultra-low voltage? If so, isn't this option something meant to be applied on APU only?
Comment 12 Alexandre Demers 2014-09-26 01:23:55 UTC
Created attachment 106884 [details]
journalctl log
Comment 13 Alexandre Demers 2014-09-26 04:42:24 UTC
I commented out every "return ret;" of si_dpm_set_power_state() in si_dpm.c. After booting this modified kernel, I can confirm this is the only error reported in si_dpm_set_power_state(): every other verification passes OK and it goes down to the very end.
Comment 14 Lorenzo Bona 2014-09-26 12:21:02 UTC
(In reply to comment #11)
> Is ULV standing for Ultra-low voltage? If so, isn't this option something
> meant to be applied on APU only?

Mmm don't know. 
Haven't digged more, but my GPU (R7-265) is running hotter than before, always around 39°-40° (even without any running application).
With the kernel in debian sid, 3.16.X, I can see low temperatures in idle.

I'm unable to trigger this warning booting up with radeon.dpm=0, so I think is something related to power management.
Comment 15 Alexandre Demers 2014-09-26 15:59:13 UTC
(In reply to comment #14)
> (In reply to comment #11)
> > Is ULV standing for Ultra-low voltage? If so, isn't this option something
> > meant to be applied on APU only?
> 
> Mmm don't know. 
> Haven't digged more, but my GPU (R7-265) is running hotter than before,
> always around 39°-40° (even without any running application).
> With the kernel in debian sid, 3.16.X, I can see low temperatures in idle.
> 
> I'm unable to trigger this warning booting up with radeon.dpm=0, so I think
> is something related to power management.

Well, according to my journalctl log, it seems to always be triggered after a power state switching. It doesn't do it from boot (power state 0) to performance (power state 1), but it is later. Strangely, I see a 
Sep 25 20:28:28 Xander kernel: switching from power state:
Sep 25 20:28:28 Xander kernel:         ui class: performance
...
Sep 25 20:28:28 Xander kernel: switching to power state:
Sep 25 20:28:28 Xander kernel:         ui class: performance

Why would it try to switch from power state 1 to power state 1 (the same power state)? And why is it at that moment the problem arises? I'll have to do more tests to see if this behaviour happens each time.
Comment 16 Alexandre Demers 2014-10-05 02:41:53 UTC
Alex, I think this "ERROR" should be at most a warning: I've been commenting out the "return ret" when we hit the error, and everything else goes as smooth as possible.

Also, do you have any clue on the way we should dig to understand why we are hitting this error? As said by Samir, this appeared with dpm.
Comment 17 Alex Deucher 2014-10-13 16:48:38 UTC
Created attachment 107784 [details] [review]
disable ulv state on SI

(In reply to Alexandre Demers from comment #16)
> Alex, I think this "ERROR" should be at most a warning: I've been commenting
> out the "return ret" when we hit the error, and everything else goes as
> smooth as possible.
> 
> Also, do you have any clue on the way we should dig to understand why we are
> hitting this error? As said by Samir, this appeared with dpm.

It's part of dpm so it only happens when dpm is enabled.  ulv is a special low power state the card can go to in certain idle cases.

Does the attached patch help?
Comment 18 Alexandre Demers 2014-10-14 05:05:33 UTC
(In reply to Alex Deucher from comment #17)
> Created attachment 107784 [details] [review] [review]
> disable ulv state on SI
> 
> (In reply to Alexandre Demers from comment #16)
> > Alex, I think this "ERROR" should be at most a warning: I've been commenting
> > out the "return ret" when we hit the error, and everything else goes as
> > smooth as possible.
> > 
> > Also, do you have any clue on the way we should dig to understand why we are
> > hitting this error? As said by Samir, this appeared with dpm.
> 
> It's part of dpm so it only happens when dpm is enabled.  ulv is a special
> low power state the card can go to in certain idle cases.
> 
> Does the attached patch help?

I changed yet again my card and I'm now running a R9 280X. I'll put the old card in tomorrow to have a look at it.

So ulv is a feature available on both APUs and 7950 (and some other GPUs). Nice to know.

But is ulv support truly supposed to be available on Tahiti? In fact, prior to your patch, why is there already a comment "/* XXX disable for A0 tahiti */" in drivers/gpu/drm/radeon/si_dpm.c but ulv.supported is set to true anyway just on the next line (the one you propose to change in your patch)? To me, it's like saying a thing and doing exactly the opposite at the same time, isn't it? Or is it because there is a special case (Tahiti) that we should be addressing identified by the comment that we are not?
Comment 19 Alex Deucher 2014-10-14 12:59:46 UTC
(In reply to Alexandre Demers from comment #18)
> 
> But is ulv support truly supposed to be available on Tahiti? In fact, prior
> to your patch, why is there already a comment "/* XXX disable for A0 tahiti
> */" in drivers/gpu/drm/radeon/si_dpm.c but ulv.supported is set to true
> anyway just on the next line (the one you propose to change in your patch)?
> To me, it's like saying a thing and doing exactly the opposite at the same
> time, isn't it? Or is it because there is a special case (Tahiti) that we
> should be addressing identified by the comment that we are not?

A0 is first silicon (basically the initial silicon samples we get back from the fab during bring up).  The issue was fixed in later silicon revisions.  There usually aren't any A0 boards in the wild.
Comment 20 Alexandre Demers 2014-10-19 05:03:48 UTC
Sadly, I won't be able to test this patch, I had an opportunity to sell my hd 7950. We can keep it open if someone else can test it.
Comment 21 Lorenzo Bona 2014-10-20 17:32:53 UTC
I've rebuilded today the whole stack (mesa, ddx, drm, xorg, and kernel) with latest commit. 

Looks like the problem is now solved. Dmesg attached.
Comment 22 Lorenzo Bona 2014-10-20 17:33:01 UTC
Created attachment 108125 [details]
dmesg | grep drm
Comment 23 sean darcy 2014-10-31 18:27:49 UTC
I've a kaveri a8-7100. After this patch, is there a way to reenable ulv without rebuilding drm?
Comment 24 Alex Deucher 2014-10-31 19:06:24 UTC
(In reply to sean darcy from comment #23)
> I've a kaveri a8-7100. After this patch, is there a way to reenable ulv
> without rebuilding drm?

This patch and bug have nothing to do with Kaveri.  It's specifically related Southern Islands GPUs.
Comment 25 Martin Peres 2019-11-19 08:54:52 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/518.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.