Bug 91141 - Lots of *ERROR* Couldn't update BO_VA (-22) since drm/radeon: stop using addr to check for BO move
Summary: Lots of *ERROR* Couldn't update BO_VA (-22) since drm/radeon: stop using addr...
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Radeon (show other bugs)
Version: DRI git
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-06-29 10:38 UTC by hadack
Modified: 2016-06-15 11:48 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg (63.74 KB, text/plain)
2015-06-29 10:38 UTC, hadack
no flags Details
errors (207.79 KB, text/plain)
2015-06-29 10:39 UTC, hadack
no flags Details
Debuging patch. (419 bytes, patch)
2015-07-03 13:50 UTC, Christian König
no flags Details | Splinter Review
output with debugging patch (135.97 KB, text/plain)
2015-07-03 14:53 UTC, hadack
no flags Details
Possible fix (2.14 KB, patch)
2015-07-03 18:49 UTC, Christian König
no flags Details | Splinter Review
dmesg with possible fix applied (221.45 KB, text/plain)
2015-07-03 22:04 UTC, hadack
no flags Details
Possible fix part 2 (1.89 KB, text/plain)
2015-07-06 12:31 UTC, Christian König
no flags Details

Description hadack 2015-06-29 10:38:40 UTC
Created attachment 116789 [details]
dmesg

This is with latest kernel from linus git tree on a CAPE VERDE card.
When the errors appears I get screen corruption when scrolling in a browser/file-manager and missing/changed letters in a terminal.

A bisect led to the commit 161ab658a611df14fb0365b7b70a8c5fed3e4870 and reverting it on master makes everything work normal again.
Comment 1 hadack 2015-06-29 10:39:23 UTC
Created attachment 116790 [details]
errors
Comment 2 Christian König 2015-06-29 13:05:13 UTC
Fix is already in Alex's drm-fixes-4.2 tree and should appear in -rc1.

If you for some reason need it sooner just cherry pick "drm/radeon: fix adding all VAs to the freed list on remove v2"
Comment 3 hadack 2015-06-30 11:11:12 UTC
Found the fix in his amdgpu branch and it fixes it, thanks!
Comment 4 Christian König 2015-06-30 13:19:42 UTC
(In reply to hadack from comment #3)
> Found the fix in his amdgpu branch and it fixes it, thanks!

Some users still report some issues even after this fix, so please keep an eye open for additional issues.

If you find some then please reopen this bug report.
Comment 5 hadack 2015-06-30 16:17:59 UTC
Hmm, seems you are right, desktop usage is fine on xfce with compton but starting a game like KSP leads to a non-refreshing screen. Reverting both commits makes it work again.
Comment 6 Dave Witbrodt 2015-07-02 16:54:21 UTC
(In reply to hadack from comment #5)
> Hmm, seems you are right, desktop usage is fine on xfce with compton but
> starting a game like KSP leads to a non-refreshing screen. Reverting both
> commits makes it work again.

I can verify the same observations on my HD 7850 (PITCAIRN 0x1002:0x6819 0x1787:0x2320) card.  I use Linux stable kernels with Radeon DRM (and core DRM) cherry-picked in from drm-next and drm-fixes.  With my last local update -- from
kernel 4.0.4 + DRM 4.1 cherry-picks, to 4.0.6 + DRM 4.1 + DRM 4.2 -- running 'alien-arena' as a test program causes the DE (also Xfce, as is the case with hadack) to stop responding once I exit the game; also, the DE itself seems to
trigger the bug after a while, or when resuming from suspend.

I tried the patch mentioned in comment 2 ("drm/radeon: fix adding all VAs to the freed list on remove v2"), but the symptoms described above continued.

Reverting 161ab658, and not applying the "fix ... VAs ... v2" patch, gives me a working kernel.  (And one I am very happy with!  My current combination of LLVM 3.7, Mesa, libdrm, xf86-video-ati, and xorg-server is the fastest, most responsive system I've ever had with open source drivers.)
Comment 7 Shawn Starr 2015-07-03 05:27:59 UTC
I confirm removing both patches mentioned (from dri-next-4.2) no issue happens for me.
Comment 8 Christian König 2015-07-03 13:50:26 UTC
Created attachment 116918 [details] [review]
Debuging patch.

I unfortunately can't reproduce the issue.

So could somebody please apply the attached patch and try to get me the result stack dump? I need to know who is calling this function.

Thanks in advance,
Christian.
Comment 9 hadack 2015-07-03 14:53:41 UTC
Created attachment 116924 [details]
output with debugging patch

Here is the output with the debugging patch applied.
Comment 10 Christian König 2015-07-03 18:49:44 UTC
Created attachment 116933 [details] [review]
Possible fix

Thanks does the attached patch fixes the issue?
Comment 11 hadack 2015-07-03 22:04:51 UTC
Created attachment 116936 [details]
dmesg with possible fix applied

Still not working with the possible fix applied.
Comment 12 Christian König 2015-07-06 12:31:51 UTC
Created attachment 116973 [details]
Possible fix part 2

Please apply this one on top of the first fix and see if the problem still happen.

Sorry that I can't find it of hand and need to check each possible cause separately, but as noted before I can't reproduce the issue here.
Comment 13 hadack 2015-07-06 17:53:39 UTC
No problem, seems the second try was it. With both patches applied it works fine. Tested standard desktop usage and KSP.
Comment 14 Christian König 2015-07-07 10:04:44 UTC
(In reply to hadack from comment #13)
> No problem, seems the second try was it. With both patches applied it works
> fine. Tested standard desktop usage and KSP.

Thanks for testing. Are you convinced enough that it works so that I can add an "Test-by: hadack@gmx.de" to the patches while pushing them towards 4.2?
Comment 15 Dave Witbrodt 2015-07-07 13:43:49 UTC
(In reply to Christian König from comment #12)
> Created attachment 116973 [details]
> Possible fix part 2
> 
> Please apply this one on top of the first fix and see if the problem still
> happen.

Works good on my machine.  The programs that triggered the bug before no longer cause any problems.

Sanity check:  I had dropped 161ab658 and b13e22ae from my list of cherry picks before in order to have a working kernel.  After adding those back, and applying

    0001-drm-radeon-allways-add-the-VM-clear-duplicate.patch
    0001-drm-radeon-check-if-BO_VA-is-set-before-adding-it-to.patch

everything works great again.  I have not yet tested suspend-to-RAM, but after the testing I've done so far I doubt there will be problems.

HTH,
DW
Comment 16 hadack 2015-07-07 14:23:20 UTC
Still working fine here, I tested all ways to trigger it and its fine. Feel free to add the tested-by.
Comment 17 Daniel Exner 2015-07-07 18:25:46 UTC
I can also confirm that a suspend resume cycle no longer floods my kernel log with linus kernel + the two patches.
Comment 18 Dave Witbrodt 2015-07-08 03:26:48 UTC
(In reply to Dave Witbrodt from comment #15)
[...]
> I have not yet tested suspend-to-RAM, but
> after the testing I've done so far I doubt there will be problems.

I tried suspend-to-RAM before leaving for work, and it resumed fine after work.  No problems at all with the code in question.

DW
Comment 19 Christian König 2016-06-15 11:48:28 UTC
I think we can close this one now.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.