Bug 97025

Summary: flip queue failed: Device or resource busy
Product: DRI Reporter: Bernd Steinhauser <linux>
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED MOVED QA Contact:
Severity: normal    
Priority: medium CC: 2bluesc, leio
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Xorg.0.log
none
dmesg output
none
dmesg output after the freeze
none
irc conversiaton with Martin Grässlin
none
plasmashell backtrace
none
dmesg after reenabled DP
none
Xorg.0.log after reenabled DP
none
Delayed recovery from display sleep logs none

Description Bernd Steinhauser 2016-07-21 15:15:26 UTC
Since last week I have random freezes with the amdgpu driver (running on Kaveri).

Once the issue occurs the display freezes. It's not fixable by switch to VT2 and back.

In Xorg.0.log I can find multiple times:
[ 92357.021] (WW) AMDGPU(0): flip queue failed: Device or resource busy
[ 92357.021] (WW) AMDGPU(0): Page flip failed: Device or resource busy
[ 92357.021] (EE) AMDGPU(0): present flip failed

No related messages in the journal or dmesg afaics.

It does not seem to be related to a specific event (like a video playing), but just happens out of nowhere.
I didn't find a way to reproduce it specifically.

Possibly related packages that I built in that time:
* dev-lang/llvm-scm::arbor 2016-06-11 07:42:19 UTC
* dev-lang/llvm-scm::arbor 2016-06-19 07:29:42 UTC
* x11-dri/mesa-12.0.0-rc4::x11 2016-06-21 21:40:48 UTC
* dev-lang/llvm-scm::arbor 2016-07-02 11:57:34 UTC
* dev-lang/clang-scm::arbor 2016-07-02 12:43:00 UTC
* dev-lang/llvm-3.8.0-r1::arbor 2016-07-12 20:04:14 UTC
* dev-lang/clang-3.8.0::arbor 2016-07-12 20:48:27 UTC
* x11-dri/mesa-12.0.0::x11 2016-07-13 04:42:47 UTC
* x11-dri/mesa-12.0.1::x11 2016-07-17 14:25:44 UTC
* x11-server/xorg-server-1.18.4::x11 2016-07-20 16:06:18 UTC

I couldn't get mesa 12 to built with llvm-scm anymore, so I downgraded.
Still, I doubt it's related.

It's hard to be certain about this, but it could have been a regressing coming with mesa 12 and possibly mesa-12.0.0.
I'm pretty sure I haven't seen the freeze before 12.0.0 final, but it's hard to be certain about this with an issue so random.

In case it matters, my xorg settings are: 
Section "Device"
    Identifier  "AMDGPU"
    Driver      "amdgpu"
    Option      "TearFree" "Off"
    Option      "EnablePageFlip" "On"
    Option      "DRI" "3"
EndSection

IIRC, this is now standard, so nothing special here.
Comment 1 Michel Dänzer 2016-07-22 01:49:10 UTC
Please attach the Xorg log and dmesg output corresponding to the problem.

(In reply to Bernd Steinhauser from comment #0)
> * x11-server/xorg-server-1.18.4::x11 2016-07-20 16:06:18 UTC

Which version of xorg-server were you using before? Does going back to that fix the problem?
Comment 2 Bernd Steinhauser 2016-07-22 04:45:25 UTC
(In reply to Michel Dänzer from comment #1)
> Please attach the Xorg log and dmesg output corresponding to the problem.
> 
> (In reply to Bernd Steinhauser from comment #0)
> > * x11-server/xorg-server-1.18.4::x11 2016-07-20 16:06:18 UTC
> 
> Which version of xorg-server were you using before? Does going back to that
> fix the problem?

Before it was 1.18.3 installed in April. I hoped that the update might improve the situation, but it didn't.
So I'm pretty sure that the xorg-server update is unrelated.
Comment 3 Bernd Steinhauser 2016-07-24 15:28:05 UTC
I noticed that I updated my kernel from 4.6.3 to 4.6.4 on 12th of July, so I thought it could be related and had a little investigation.
Then I stumbled across this log, which I think was the first time this happened. This is from journald:
Jul 09 08:59:43 orionis kernel: Linux version 4.6.3-amdgpu (root@orionis) (gcc version 5.3.0 (GCC) ) #1 SMP PREEMPT Sat Jun 25 21:20:12 CEST 2016
[...]
Jul 09 17:04:08 orionis kernel: [drm:amdgpu_crtc_page_flip] *ERROR* failed to get vblank before flip
Jul 09 17:04:09 orionis kernel: [drm:amdgpu_crtc_page_flip] *ERROR* failed to get vblank before flip

No idea why in this case I can find some messages in the journal and in the other cases not.
Anyway, this means that the origin is not the update mesa-12.0.0-rc4 -> final and also not linux 4.6.3 -> 4.6.4.

Also unlikely 4.6.2 -> 4.6.3, since (as you can see above) this was built approx. 2 weeks before and within that amount of time I would surely have experienced the problem.
(Had it approx. 8 to 10 times during the last 2 weeks.)

Another message I found in a different log is:
Jul 24 00:45:17 orionis kernel: [drm:amdgpu_atombios_dp_link_train] *ERROR* displayport link status failed
Jul 24 00:45:17 orionis kernel: [drm:amdgpu_atombios_dp_link_train] *ERROR* clock recovery failed

Not sure if it is related.

With regards to what package started to bring this up, I'm now almost out of ideas.
The only thing left would be kwin/plasma 5. The Update from Plasma 5.6.95 to 5.7.0 was performed on the 5th of July.
So, since kwin is what I use as a compositor (and Plasma 5 as a desktop), it might be able that this triggers a bug?
Comment 4 Michel Dänzer 2016-07-26 03:42:15 UTC
Still looking for the full Xorg log and dmesg output, preferably captured after the problem occurred.

Does restarting kwin recover from the hang?
Comment 5 Bernd Steinhauser 2016-07-26 04:01:23 UTC
Sorry, missed that request in your post above.

dmesg output I don't have available as I didn't have ssh activated when the problem occurred.
(now I do)

I could attach the journald kernel output if that would be sufficient?
Comment 6 Bernd Steinhauser 2016-07-26 04:02:13 UTC
Created attachment 125329 [details]
Xorg.0.log
Comment 7 Bernd Steinhauser 2016-07-26 04:15:20 UTC
Created attachment 125333 [details]
dmesg output

dmesg output from the currently running system.

Attaching this as I noticed that I do get those vblank/flip messages even now, when I didn't experience the bug (yet).
Comment 8 Bernd Steinhauser 2016-07-26 18:06:22 UTC
I noticed that those two lines coincident with a certain event I can trigger:
Switching the DP-0 display off (an Eizo EV2455).
This leads to a disconnect of the DP connection and that leads (somehow) to the quoted messages about the failed vblank.
(I'm not sure if the disconnect is actually a bug in the kernel (as it's a DP1.2 display) or if it's my hardware/mainboard/gpu too old.)

However, this disconnect does not lead straight to the freeze.
And so far I haven't seen the bug directly after a DP disconnect, but just at some random point.
Comment 9 Bernd Steinhauser 2016-07-30 08:05:47 UTC
Created attachment 125433 [details]
dmesg output after the freeze

I logged into the machine during a freeze and saved the dmesg output.
Unfortunately, it doesn't seem to contain additional information.
Comment 10 Bernd Steinhauser 2016-08-09 18:36:41 UTC
Since the weekend, I ran kwin without compositing.

Since then, I haven't seen this happening, so I think this is a bug that is triggered by kwin when compositing, likely since 4.7.0.
Comment 11 Michel Dänzer 2016-08-10 08:30:02 UTC
Does explicitly disabling the DP output in the KDE configuration before turning off the monitor avoid the problem?
Comment 12 Bernd Steinhauser 2016-08-12 17:27:54 UTC
It does prevent the vblank messages in dmesg, I don't know if it'll prevent the freeze.
Comment 13 Bernd Steinhauser 2016-08-18 16:03:33 UTC
One more remark: I've only observed the effect when the OpenGL 3.1 compositing backend in kwin is active.
I tested with OpenGL 2 backend over the last week and have not seen this happening since.

I should also mention that I've had the egl interface activated, which is not recommended for kwin.
I've not had issues with it before, but it could be related, so the next thing I'm testing is glx/OpenGL 3.1 and hope I can narrow this down this way.
Comment 14 Bernd Steinhauser 2016-08-21 15:37:01 UTC
Ok, it's not egl, the same happens with glx/OpenGL3.
Comment 15 Bernd Steinhauser 2016-08-31 14:44:13 UTC
I tried a few things, but wasn't really able to nail this down.
I downgraded to mesa 11.2 to see if that helps, but it does not.

However, today I had plasmashell freezing after unlocking the screen.
Only plasmashell froze, everything else kept working as expected.

I contacted Martin on IRC and he thought it might be related to this.
I'll attach the log from the conversation as well as the backtrace.

He might be right, because around the time when this happened, I get these messages in dmesg:
[88765.431890] [drm:amdgpu_crtc_page_flip] *ERROR* failed to reserve new rbo buffer before flip
[88765.436865] [drm:amdgpu_crtc_page_flip] *ERROR* failed to reserve new rbo buffer before flip
[88765.441940] [drm:amdgpu_crtc_page_flip] *ERROR* failed to reserve new rbo buffer before flip
[88765.446861] [drm:amdgpu_crtc_page_flip] *ERROR* failed to reserve new rbo buffer before flip
[88765.451865] [drm:amdgpu_crtc_page_flip] *ERROR* failed to reserve new rbo buffer before flip
[88765.456903] [drm:amdgpu_crtc_page_flip] *ERROR* failed to reserve new rbo buffer before flip
[89579.510005] [drm:amdgpu_crtc_page_flip] *ERROR* failed to reserve new rbo buffer before flip
[89579.514998] [drm:amdgpu_crtc_page_flip] *ERROR* failed to reserve new rbo buffer before flip
[89579.520053] [drm:amdgpu_crtc_page_flip] *ERROR* failed to reserve new rbo buffer before flip
[89579.525158] [drm:amdgpu_crtc_page_flip] *ERROR* failed to reserve new rbo buffer before flip
[113833.139104] [drm:amdgpu_atombios_dp_link_train] *ERROR* displayport link status failed
[113833.139117] [drm:amdgpu_atombios_dp_link_train] *ERROR* clock recovery failed
[113833.361471] [drm:amdgpu_atombios_dp_link_train] *ERROR* displayport link status failed
[113833.361484] [drm:amdgpu_atombios_dp_link_train] *ERROR* clock recovery failed
[113836.962993] [drm:amdgpu_crtc_page_flip] *ERROR* failed to get vblank before flip
Comment 16 Bernd Steinhauser 2016-08-31 14:44:43 UTC
Created attachment 126131 [details]
irc conversiaton with Martin Grässlin
Comment 17 Bernd Steinhauser 2016-08-31 14:45:17 UTC
Created attachment 126132 [details]
plasmashell backtrace
Comment 18 Michel Dänzer 2016-09-01 03:44:21 UTC
(In reply to Bernd Steinhauser from comment #15)
> However, today I had plasmashell freezing after unlocking the screen.
> Only plasmashell froze, everything else kept working as expected.
[...]
> [...] I get these messages in dmesg:

There are some messages with a timestamp around 89xxx and some with a timestamp around 11383x. Almost 7 hours passed in between, so which group of messages corresponds to the plasmashell freeze? Probably the latter? Those look again like the DP connection is lost. Were you able to determine if explicitly disabling the DP output in the kwin settings avoids the freezes?
Comment 19 Bernd Steinhauser 2016-09-01 05:08:24 UTC
Yes, the ones around 11383x.

I can't yet be sure about DP, but I'll check again.
The problem is that I can't find a way to trigger it, it just happens randomly.

The DisplayPort Monitor is my main screen, it would mean I have to work for 1 week or so without it.
Comment 20 Bernd Steinhauser 2016-09-06 04:50:35 UTC
Ok, running for approx. 4 days now with DP-0 deactivated and so far didn't spot any problems.
Only at the very start, I could find these messages, but that was before running kde:
[   14.404932] [drm:amdgpu_atombios_dp_link_train] *ERROR* displayport link status failed
[   14.404939] [drm:amdgpu_atombios_dp_link_train] *ERROR* clock recovery failed

Still, it's hard to tell for this kind of problem that occurs so randomly.

I'll have a search if I have another DP cable, so I can check that.
Comment 21 Bernd Steinhauser 2016-09-07 21:34:16 UTC
Ok, so I replaced the DP cable and reenabled the screen. Immediately after that I got these messages in dmesg. Note the time.
[338324.267684] [drm:amdgpu_crtc_page_flip] *ERROR* failed to get vblank before flip
[338324.489710] [drm:amdgpu_crtc_page_flip] *ERROR* failed to get vblank before flip
[338526.834794] [drm:amdgpu_atombios_dp_link_train] *ERROR* displayport link status failed
[338526.834801] [drm:amdgpu_atombios_dp_link_train] *ERROR* clock recovery failed
[338526.838652] [drm:amdgpu_atombios_dp_link_train] *ERROR* displayport link status failed
[338526.838655] [drm:amdgpu_atombios_dp_link_train] *ERROR* clock recovery failed

After that, no messages (related to the graphics stack) appeared in dmesg so far.
However, the X server log is now spammed with messages every few seconds:
[338324.859] (WW) AMDGPU(0): flip queue failed: Invalid argument 
[338324.859] (WW) AMDGPU(0): Page flip failed: Invalid argument
[338324.859] (EE) AMDGPU(0): present flip failed
[338324.940] (WW) AMDGPU(0): get vblank counter failed: Invalid argument
[338324.942] (WW) AMDGPU(0): get vblank counter failed: Invalid argument 
[338324.942] (WW) AMDGPU(0): flip queue failed: Device or resource busy 
[338324.942] (WW) AMDGPU(0): Page flip failed: Device or resource busy

This started right after activating the DP screen.
I guess sooner or later that will result in the freeze that I'm seeing.
(I'll upload both dmesg and Xorg.0.log.)

So yeah, it seems like this a problem with the DP.
Since I don't think that I have two broken DP cables, I guess the problem is somewhere else.
If that would help, I can connect one of the other screens via DP and see if that makes a difference.
Comment 22 Bernd Steinhauser 2016-09-07 21:34:51 UTC
Created attachment 126285 [details]
dmesg after reenabled DP
Comment 23 Bernd Steinhauser 2016-09-07 21:35:28 UTC
Created attachment 126286 [details]
Xorg.0.log after reenabled DP
Comment 24 Kevin McCormack 2017-01-12 18:46:16 UTC
I am experiencing what I think may be a similar issue. When my display sleeps, it often does not wake up on keypress. I have to wait anywhere from a few seconds to a few minutes and then have errors in my log like the following

[drm:amdgpu_atombios_dp_link_train [amdgpu]] *ERROR* clock recovery failed
[drm:amdgpu_atombios_dp_link_train [amdgpu]] *ERROR* clock recovery failed

I am running Antergos 64-bit with GNOME 3.22.2 on Wayland
Kernels 4.8.13 and 4.10.0-rc3-ga121103c9228
AMD FX-8370 
Sapphire Fury X
Comment 25 Kevin McCormack 2017-01-12 18:47:26 UTC
Created attachment 128916 [details]
Delayed recovery from display sleep logs
Comment 27 Bernd Steinhauser 2018-05-12 08:27:23 UTC
Thanks, I'm testing it right now on linux 4.16.8.

Although I'm not sure if it works as expected, since the display does still seem to disconnect when I turn the screen off.

At least the messages in dmesg are gone, so it's definitely different compared to previous tests.
Can't say anything about the freezes without extensive testing, though.
Comment 28 Michel Dänzer 2018-05-14 10:01:40 UTC
(In reply to Bernd Steinhauser from comment #27)
> 
> Although I'm not sure if it works as expected, since the display does still
> seem to disconnect when I turn the screen off.

AFAIK that's either a monitor or general DisplayPort issue. The drivers can't prevent it but have to cope with it.
Comment 29 Bernd Steinhauser 2018-05-20 15:09:38 UTC
(In reply to Michel Dänzer from comment #28)
> (In reply to Bernd Steinhauser from comment #27)
> > 
> > Although I'm not sure if it works as expected, since the display does still
> > seem to disconnect when I turn the screen off.
> 
> AFAIK that's either a monitor or general DisplayPort issue. The drivers
> can't prevent it but have to cope with it.

Quite possible. I've seen such behaviour on Windows as well on some displays.
Don't really get it, it's very annoying if your windows are rearrange just because you turned off a display to save some power.

Anyway back to topic:
[595475.710884] [drm:amdgpu_atombios_dp_link_train] *ERROR* displayport link status failed
[595475.710902] [drm:amdgpu_atombios_dp_link_train] *ERROR* clock recovery failed

I do still get those messages sometimes, but at least I didn't experience any lockups or freezes.
Comment 30 Fermulator 2018-08-13 15:31:47 UTC
note, experiencing the same (or at least similar) issues -- my story is bug'd here:
 * https://bugs.freedesktop.org/show_bug.cgi?id=107560
Comment 31 Martin Peres 2019-11-19 08:08:49 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/80.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.