Bug 92936

Summary: Tonga powerplay isssues
Product: DRI Reporter: Andy Furniss <adf.lists>
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED WORKSFORME QA Contact:
Severity: normal    
Priority: medium CC: alexander, edward.ocallaghan
Version: DRI git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg
none
a couple of locks using uvd.
none
possible fix
none
vce hung task
none
uvd hung task none

Description Andy Furniss 2015-11-13 10:50:55 UTC
Created attachment 119627 [details]
dmesg

Testing the amdgpu-powerplay brandh I have several issues.

The first and possibly the cause of others (so for now I won't do separate bugs) is on boot I see -

Can't find requested voltage id in vdd_dep_on_sclk table!

Testing Games/demos all is good and perf is as expected.

Testing UVD however and perf is way worse than before powerplay.

I see in dmesg -

[ 1818.011402] Failed to send Message.
[ 1818.011407] Failed to send Previous Message.

After doing this I also may no longer get good perf from gamed/demos.

Trying to force high for the uvd test doesn't help and I notice another issue -

/sys/class/drm/card0/device/power_dpm_force_performance_level

is auto to start with, if I echo high > it works = GPU gets hotter, but I can't change it back. If I echo auto > ... then cat it, it will still say high and GPU remains hot.

Separately when I quit X I may get some new errors in dmesg

Nov 12 11:36:50 ph4 kernel: Failed to send Previous Message.
Nov 12 11:36:50 ph4 kernel: last message repeated 2 times
Nov 12 11:36:50 ph4 kernel: unforce pcie level failed!
Comment 1 Alex Deucher 2015-11-13 15:53:30 UTC
Can you try updating the the latest powerplay branch?  Specifically this patch:
http://cgit.freedesktop.org/~agd5f/linux/commit/?h=amdgpu-powerplay&id=7dd7b21debf064ecabd9ae09beb8a85ef492c46d
Comment 2 Andy Furniss 2015-11-13 20:10:30 UTC
This does boost uvd perf above historical level, though not as much as I expected.

The Failed to send Message is also gone.

Can't find requested voltage id in vdd_dep_on_sclk table! is still present in dmesg.

The use of UVD still regresses OpenGL perf.

It seems it's when I am in this state power_dpm_force_performance_level does not accept input. It is still "untouched" = auto and now if I try high it won't take it. In dmesg I see -

[ 2761.751646] Failed to send Message.
[ 2761.751649] force highest pcie dpm state failed!
Comment 3 Andy Furniss 2015-11-15 14:29:05 UTC
I found today by luck that this issue does not exist if there are 2 displays alive when I boot. The second display being an HDMI TV, which as it's not me using it I --off with xrandr after startx - but I can't then get UVD usage to break powerplay in this senario.
Comment 4 Andy Furniss 2015-12-10 12:06:18 UTC
Created attachment 120450 [details]
a couple of locks using uvd.

Issues still exist in current powerplay.

Though testing uvd when there are issues may be a bit pointless -

It's possible to get it to produce corrupted output.

Eventually, using it may cause a lock up. Attached first using mplayer the second gstreamer (vaapi).

I have locked in the past using gst omx.
Comment 5 Andy Furniss 2015-12-12 23:54:15 UTC
Comment on attachment 120450 [details]
a couple of locks using uvd.

Maybe the locks are not a powerplay issue, I managed to lock gst vaapi with powerplay=0 and on drm-next-4.5 will have to investigate more/open a new bug over time.
Comment 6 Andy Furniss 2015-12-19 12:16:01 UTC
Ignoring lockups for now as maybe powerplay/4.5 need something like -

fixes-4.4 commit drm/amdgpu: fix user fence handling

plus I see vaapi threading fixes waiting in mesa.

Testing latest powerplay and the UVD breaks things issue is still present and there is nothing new in dmesg.

Looking at cat /sys/kernel/debug/dri/64/amdgpu_pm_info what happens is -

Testing opengl initially working though gpu load sometimes looks a bit high when nothing is happening (clocks still low).

test just UVD (no vo) on auto,  mclk and sclk rise (sclk not 100% though)

After this it seems mclk gets stuck low so running opengl looks like -

 [  mclk  ]: 150 MHz

 [  sclk  ]: 972 MHz

 [GPU load]: 100%

Nothing is logged at this point.

If I now try to force high it fails and I see 

[ 7445.269812] Failed to send Message.
[ 7445.269815] force highest mclk dpm state failed!

Clocks now look like 

 [  mclk  ]: 0 MHz

 [  sclk  ]: 0 MHz

 [GPU load]: 0%

using opengl perf is still low and clocks still say 0 load 100%.

If I use uvd the clocks do go high

 [  mclk  ]: 1375 MHz

 [  sclk  ]: 973 MHz

 [GPU load]: 23%

and they stay high and opengl perf is good again.

cat /sys/class/drm/card0/device/power_dpm_force_performance_level

still says auto.
Comment 7 Andy Furniss 2015-12-19 12:41:46 UTC
Continued from above.

With clocks stuck high if I echo auto > ...

the sclk does fall but mclk is stuck high.

If I then use UVD mclk will fall, but will be stuck there when using opengl.
Comment 8 Andy Furniss 2016-01-25 21:34:45 UTC
Issue(s) still exist in latest drm-next-4.6-wip.

With the new sys interface I can also see pcie which seems OK. The message when trying to force a level after uvd use has "fixed" mclck has changed = 

[  804.212722] Failed to send Message.
[  804.212726] Target_and_current_Profile_Index. Curr_Mclk_Index does not match the level
[  806.957942] Failed to send Previous Message.
Comment 9 Alex Deucher 2016-02-05 04:13:12 UTC
Created attachment 121533 [details] [review]
possible fix

Does this patch help?
Comment 10 Andy Furniss 2016-02-05 10:41:14 UTC
Will test patch sometime later, as it happens I built latest 4.6-wip last night and managed to see memclk stuck without (knowingly) touching UVD.

I was running with auto and it's quite possible that leaving alone and running another high usage gl app would have worked, but to test I tried forcing and got -

Feb  5 00:38:44 ph4 kernel: Failed to send Message.
Feb  5 00:38:44 ph4 kernel: [ powerplay ] Target_and_current_Profile_Index. ^I^I^I^I^I^ICurr_Mclk_Index does not match the level 
Feb  5 00:38:47 ph4 kernel: Failed to send Previous Message.

In addition after that I rebuilt latest llvm/mesa which has resulted in dmesg full of vmfaults, so I now need to sort that first.
Comment 11 Andy Furniss 2016-02-05 14:07:26 UTC
(In reply to Alex Deucher from comment #9)
> Created attachment 121533 [details] [review] [review]
> possible fix
> 
> Does this patch help?

Yes, it seems the patch is good.

Early days but I've so far failed to get any errors with mixed running uvd/gl/forcing levels.

It also seems to have fixed another UVD issue where repeated running, varying between vdpau,vaapi and omx would eventually start producing corrupted output.

UVD also still has full perf which is good :-)

With or without this patch there is still an issue around the auto perf setting in that for real world video play back I would need to force clocks high as auto gpu load detection doesn't up the clocks enough to do demanding tests. I am talking about something like 2160p60 content scaled down to 1080p. Testing with mpv whether --vo=vdpau vaapi opengl or opengl-hq will all fail to register enough load to get the clocks up.

UVD alone (ie testing to ram) will up the clocks so auto perf is quite close to high, but combined with really displaying it won't because I guess it gets limited by the --vo, so doesn't get to go fast enough to up the clocks its self.
Comment 12 Alex Deucher 2016-02-05 22:24:48 UTC
Can you try my latest 4.6 wip branch?  I fixed it in a more unified way.
http://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-4.6-wip
Comment 13 Andy Furniss 2016-02-06 22:26:34 UTC
(In reply to Alex Deucher from comment #12)
> Can you try my latest 4.6 wip branch?  I fixed it in a more unified way.
> http://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-4.6-wip

This also fixes the memclk getting stuck and uvd corruption issues.
Comment 14 Andy Furniss 2016-02-08 22:56:57 UTC
Been testing vce and there is an issue with auto.

This is not just this kernel, I've been back and now I have a script to test lots of runs, I can reproduce on older kernels + current fixes as well.

The issue is that it will hang, at this time I am apparently OK in that I can use desktop normally. There is no hung task timeout. If I kill the gstreamer process it won't return, then I may get a hung task trace.

Whether I kill gstreamer or not, quitting X or a VT switch will lock up display.

This only happens when /sys/class/drm/card0/device/power_dpm_force_performance_level is auto.

If it's high or low I can repeatedly run vce encodes OK - I have tested > 1000.

On this kernel and fixes it only takes < 5 runs to lock with the same test.

On an older kernel it lasted for 25 runs (which is I guess why I didn't hit it in "normal" testing + I often forced high for bench marking anyway)
Comment 15 Andy Furniss 2016-02-08 23:01:04 UTC
Created attachment 121605 [details]
vce hung task
Comment 16 Andy Furniss 2016-02-11 18:30:16 UTC
(In reply to Andy Furniss from comment #13)
> (In reply to Alex Deucher from comment #12)
> > Can you try my latest 4.6 wip branch?  I fixed it in a more unified way.
> > http://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-4.6-wip
> 
> This also fixes the memclk getting stuck and uvd corruption issues.

It seems the changes got lost in the latest drm-next-4.6-wip update so I am back to stuck mclck.
Comment 17 Alex Deucher 2016-02-11 18:36:06 UTC
(In reply to Andy Furniss from comment #16)
> (In reply to Andy Furniss from comment #13)
> > (In reply to Alex Deucher from comment #12)
> > > Can you try my latest 4.6 wip branch?  I fixed it in a more unified way.
> > > http://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-4.6-wip
> > 
> > This also fixes the memclk getting stuck and uvd corruption issues.
> 
> It seems the changes got lost in the latest drm-next-4.6-wip update so I am
> back to stuck mclck.

The fixes went into my drm-fixes branch (drm-fixes-4.5) which will go upstream for 4.5.  My drm-next-4.6-wip branch is just new features for 4.6.
Comment 18 Andy Furniss 2016-02-11 20:07:31 UTC
(In reply to Alex Deucher from comment #17)
> (In reply to Andy Furniss from comment #16)
> > (In reply to Andy Furniss from comment #13)
> > > (In reply to Alex Deucher from comment #12)
> > > > Can you try my latest 4.6 wip branch?  I fixed it in a more unified way.
> > > > http://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-4.6-wip
> > > 
> > > This also fixes the memclk getting stuck and uvd corruption issues.
> > 
> > It seems the changes got lost in the latest drm-next-4.6-wip update so I am
> > back to stuck mclck.
> 
> The fixes went into my drm-fixes branch (drm-fixes-4.5) which will go
> upstream for 4.5.  My drm-next-4.6-wip branch is just new features for 4.6.

Ahh, OK.
Comment 19 Andy Furniss 2016-05-17 10:08:12 UTC
Thought I'd already done this one but apparently not.

cat /sys/kernel/debug/dri/64/amdgpu_pm_info

Ever since this option appeared, so before powergating was disabled/clockgating enabled and currently. VCE is always shown as enabled. UVD behaves and only says enabled when in use.
Comment 20 Andy Furniss 2016-05-17 10:41:31 UTC
Though UVD seems to work for normal samples upto 2160p now, with an extreme test I can lock GPU using it with powerplay=1 on low/auto but not (so far) =0.

With this sample there is also a corruption issue even with powerplay=0.

The sample decodes perfectly every time at full speed (player) with powerplay=1 and clocks forced high.

Tested with mpv mainly, but kodi, mplayer, decoding to ram with ffmpeg, gst omx or gst vaapi all give similar results. Similar = decoding to ram may avoid the lock (which always happens with players) but will be corrupt on auto. It is still possible to lock decoding to ram.

Historically with a "normal" 2160p60 I have seen this rarely, but something seems to have become more efficient so that 2160p60 that needed clocks forced high will work on auto now.

The issue with this sample exists on older kernels as well as current.

The Sample is rather large and 4080x4096, it is "free" for testing AIUI as the source images are from vqeg.

I thought I would upload it as I notice Leo Liu works on UVD and has a tonga.

It's 300 meg for 8.6 seconds! made to level 5.2 cbr (but is > 5.2 due to frame size/num refs)

https://drive.google.com/file/d/0BxP5-S1t9VEEWGREeXlrQkZfaDQ/view?usp=sharing

Testing on a 1920x1080 screen with mpv -fs --hwdec=vdpau it will hang quickly though GPU is still OK at this point. I could even run a gl Unigine bench  OK (though mem clock is stuck low).

Trying to switch vt or quit X would hang display.

pkill -9 mpv won't instantly lock display (I use non compositing desktop), but touching gl even glxinfo will then lock display.

Waiting 2 minutes before sysrq will give attached hung task trace.
Comment 21 Andy Furniss 2016-05-17 10:42:06 UTC
Created attachment 123820 [details]
uvd hung task
Comment 22 Andy Furniss 2016-07-16 14:42:06 UTC
(In reply to Andy Furniss from comment #14)
> Been testing vce and there is an issue with auto.
> 
> This is not just this kernel, I've been back and now I have a script to test
> lots of runs, I can reproduce on older kernels + current fixes as well.
> 
> The issue is that it will hang, at this time I am apparently OK in that I
> can use desktop normally. There is no hung task timeout. If I kill the
> gstreamer process it won't return, then I may get a hung task trace.
> 
> Whether I kill gstreamer or not, quitting X or a VT switch will lock up
> display.
> 
> This only happens when
> /sys/class/drm/card0/device/power_dpm_force_performance_level is auto.
> 
> If it's high or low I can repeatedly run vce encodes OK - I have tested >
> 1000.
> 
> On this kernel and fixes it only takes < 5 runs to lock with the same test.
> 
> On an older kernel it lasted for 25 runs (which is I guess why I didn't hit
> it in "normal" testing + I often forced high for bench marking anyway)

With the latest tonga vce firmware + current agd5f drm-next kernels I can't reproduce this anymore.
Comment 23 Andy Furniss 2016-11-07 10:17:15 UTC
While this bug is a bit messy being a multi bug, there are still issues = I can still lock display as in 

https://bugs.freedesktop.org/show_bug.cgi?id=92936#c20

I will close this one and open one issue bug(s) when I get some time to re-test everything.
Comment 24 Andy Furniss 2018-11-30 17:27:02 UTC
Doesn't lock with a brief test, though doesn't always decode properly with mpv vdpau.

Solid with vaapi and seen as vdpau is dead I care less about testing with it.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.