Bug 58354 - [bisected] r600g: use DMA engine for VM page table updates on cayman locks in Unigine Tropics
Summary: [bisected] r600g: use DMA engine for VM page table updates on cayman locks in...
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Gallium/r600 (show other bugs)
Version: git
Hardware: Other All
: medium critical
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on: 58667
Blocks:
  Show dependency treegraph
 
Reported: 2012-12-16 06:26 UTC by Alexandre Demers
Modified: 2013-11-08 15:39 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
possible fix (3.69 KB, patch)
2012-12-23 02:27 UTC, Alex Deucher
Details | Splinter Review
dmesg after lockup (57.18 KB, text/x-log)
2012-12-23 03:45 UTC, Alexandre Demers
Details
dmesg from killing Xorg remotely when frozen with patch 72013 applied (129.88 KB, text/plain)
2013-01-09 01:27 UTC, Alexandre Demers
Details
possible fix (4.24 KB, patch)
2013-01-10 18:04 UTC, Alex Deucher
Details | Splinter Review
errors.log when tropics froze with patch 72794 (63.62 KB, text/plain)
2013-01-11 01:01 UTC, Alexandre Demers
Details
patch 1/2 (1.39 KB, patch)
2013-01-31 20:22 UTC, Alex Deucher
Details | Splinter Review
patch 2/2 (14.82 KB, patch)
2013-01-31 20:23 UTC, Alex Deucher
Details | Splinter Review

Description Alexandre Demers 2012-12-16 06:26:59 UTC
Testing with drm-next with latest mesa, ddx and drm, Unigine Tropics locks up when launching the demo. The problem appears somewhere between a636a9829175987e74ddd28a2e87ed17ff7adfdc (locks) and 1a1494def7eacbd25db05185aa2e81ef90892460 (OK). I'll pinpoint it tomorrow.
Comment 1 Alexandre Demers 2012-12-16 17:40:18 UTC
33e5467871b3007c4e6deea95b2cac38a55ff9f5 is the first bad commit
commit 33e5467871b3007c4e6deea95b2cac38a55ff9f5
Author: Alex Deucher <alexander.deucher@amd.com>
Date:   Mon Oct 22 12:22:39 2012 -0400

    drm/radeon: use DMA engine for VM page table updates on cayman/TN
    
    DMA engine has special packets to facilitate this and it also keeps
    the 3D engine free for other things.
    
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Comment 2 Alexandre Demers 2012-12-16 17:45:16 UTC
Obviously, this is on a 6950 Cayman.
Comment 3 Alex Deucher 2012-12-23 00:12:38 UTC
Can you get the dmesg output when the lockup happens?
Comment 4 Alexandre Demers 2012-12-23 00:58:07 UTC
Hitting some other bug right now with bug 58655. I'll apply the proposed patch for the other bug and I'll see what I get then.
Comment 5 Alex Deucher 2012-12-23 02:27:01 UTC
Created attachment 72013 [details] [review]
possible fix

Does this patch fix the issue?
Comment 6 Alexandre Demers 2012-12-23 03:45:18 UTC
Created attachment 72014 [details]
dmesg after lockup

This is the salvaged dmesg retrieved with the help of a ssh connection. Sadly, I don't think there is anything useful in there.
Comment 7 Alexandre Demers 2012-12-23 03:46:17 UTC
Would it help if I was increasing the debug level?
Comment 8 Alexandre Demers 2012-12-23 04:07:18 UTC
(In reply to comment #5)
> Created attachment 72013 [details] [review] [review]
> possible fix
> 
> Does this patch fix the issue?

Testing right away.
Comment 9 Alexandre Demers 2012-12-23 07:17:05 UTC
Doesn't fix it, it locks as before. Sadly, dmesg seems to loose the count because of another bug introduced in 3.8-rc1. Now that I moved to 3.8-rc1, there is a huge amount of messages appearing in errors.log and dmesg (when typed in the terminal):
...
[ 6223.054880] radeon 0000:01:00.0: GPU fault detected: 146 0x00239514
[ 6223.054882] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 6223.054883] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[ 6223.054885] radeon 0000:01:00.0: GPU fault detected: 146 0x00135514
[ 6223.054887] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 6223.054889] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[ 6223.054891] radeon 0000:01:00.0: GPU fault detected: 146 0x00239514
[ 6223.054893] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 6223.054895] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[ 6223.054897] radeon 0000:01:00.0: GPU fault detected: 146 0x0033a514
[ 6223.054899] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 6223.054900] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[ 6223.054902] radeon 0000:01:00.0: GPU fault detected: 146 0x00136514
[ 6223.054904] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 6223.054906] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[ 6223.054908] radeon 0000:01:00.0: GPU fault detected: 146 0x0033a514
[ 6223.054910] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 6223.054912] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[ 6223.054914] radeon 0000:01:00.0: GPU fault detected: 146 0x0033a514
[ 6223.054916] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 6223.054918] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[ 6223.054920] radeon 0000:01:00.0: GPU fault detected: 146 0x00232514
[ 6223.054922] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 6223.054923] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[ 6223.054925] radeon 0000:01:00.0: GPU fault detected: 146 0x00232514
[ 6223.054927] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 6223.054930] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[ 6223.054932] radeon 0000:01:00.0: GPU fault detected: 146 0x0033d514
[ 6223.054934] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 6223.054936] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[ 6223.054938] radeon 0000:01:00.0: GPU fault detected: 146 0x0033d514
[ 6223.054940] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 6223.054942] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[ 6223.054944] radeon 0000:01:00.0: GPU fault detected: 146 0x00235514
[ 6223.054946] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 6223.054948] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[ 6223.054950] radeon 0000:01:00.0: GPU fault detected: 146 0x00235514
[ 6223.054952] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 6223.054954] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[ 6223.054956] radeon 0000:01:00.0: GPU fault detected: 146 0x0033e514
[ 6223.054958] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 6223.054960] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[ 6223.054962] radeon 0000:01:00.0: GPU fault detected: 146 0x0033e514
[ 6223.054963] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 6223.054965] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[ 6223.054967] radeon 0000:01:00.0: GPU fault detected: 146 0x00339514
[ 6223.054969] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 6223.054971] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[ 6223.054973] radeon 0000:01:00.0: GPU fault detected: 146 0x00339514
[ 6223.054975] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 6223.054977] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[ 6223.054979] radeon 0000:01:00.0: GPU fault detected: 146 0x00236514
[ 6223.054980] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 6223.054982] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
...

I'm sure it is a different bug, can you confirm? dmesg.log stops to be populated when X/Gnome start, but typing dmesg in a terminal outputs tons of the reported messages. Should I open a new bug for it or has it been already reported?
Comment 10 Alexandre Demers 2012-12-23 07:23:43 UTC
(In reply to comment #9)
> Doesn't fix it, it locks as before. Sadly, dmesg seems to loose the count
> because of another bug introduced in 3.8-rc1. Now that I moved to 3.8-rc1,
> there is a huge amount of messages appearing in errors.log and dmesg (when
> typed in the terminal):
> ...
> [ 6223.054880] radeon 0000:01:00.0: GPU fault detected: 146 0x00239514
> [ 6223.054882] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
> 0x00000000
> [ 6223.054883] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x00000000
> [ 6223.054885] radeon 0000:01:00.0: GPU fault detected: 146 0x00135514
> [ 6223.054887] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
> 0x00000000
> [ 6223.054889] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x00000000
> [ 6223.054891] radeon 0000:01:00.0: GPU fault detected: 146 0x00239514
> [ 6223.054893] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
> 0x00000000
> [ 6223.054895] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x00000000
> [ 6223.054897] radeon 0000:01:00.0: GPU fault detected: 146 0x0033a514
> [ 6223.054899] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
> 0x00000000
> [ 6223.054900] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x00000000
> [ 6223.054902] radeon 0000:01:00.0: GPU fault detected: 146 0x00136514
> [ 6223.054904] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
> 0x00000000
> [ 6223.054906] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x00000000
> [ 6223.054908] radeon 0000:01:00.0: GPU fault detected: 146 0x0033a514
> [ 6223.054910] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
> 0x00000000
> [ 6223.054912] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x00000000
> [ 6223.054914] radeon 0000:01:00.0: GPU fault detected: 146 0x0033a514
> [ 6223.054916] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
> 0x00000000
> [ 6223.054918] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x00000000
> [ 6223.054920] radeon 0000:01:00.0: GPU fault detected: 146 0x00232514
> [ 6223.054922] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
> 0x00000000
> [ 6223.054923] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x00000000
> [ 6223.054925] radeon 0000:01:00.0: GPU fault detected: 146 0x00232514
> [ 6223.054927] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
> 0x00000000
> [ 6223.054930] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x00000000
> [ 6223.054932] radeon 0000:01:00.0: GPU fault detected: 146 0x0033d514
> [ 6223.054934] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
> 0x00000000
> [ 6223.054936] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x00000000
> [ 6223.054938] radeon 0000:01:00.0: GPU fault detected: 146 0x0033d514
> [ 6223.054940] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
> 0x00000000
> [ 6223.054942] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x00000000
> [ 6223.054944] radeon 0000:01:00.0: GPU fault detected: 146 0x00235514
> [ 6223.054946] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
> 0x00000000
> [ 6223.054948] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x00000000
> [ 6223.054950] radeon 0000:01:00.0: GPU fault detected: 146 0x00235514
> [ 6223.054952] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
> 0x00000000
> [ 6223.054954] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x00000000
> [ 6223.054956] radeon 0000:01:00.0: GPU fault detected: 146 0x0033e514
> [ 6223.054958] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
> 0x00000000
> [ 6223.054960] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x00000000
> [ 6223.054962] radeon 0000:01:00.0: GPU fault detected: 146 0x0033e514
> [ 6223.054963] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
> 0x00000000
> [ 6223.054965] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x00000000
> [ 6223.054967] radeon 0000:01:00.0: GPU fault detected: 146 0x00339514
> [ 6223.054969] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
> 0x00000000
> [ 6223.054971] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x00000000
> [ 6223.054973] radeon 0000:01:00.0: GPU fault detected: 146 0x00339514
> [ 6223.054975] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
> 0x00000000
> [ 6223.054977] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x00000000
> [ 6223.054979] radeon 0000:01:00.0: GPU fault detected: 146 0x00236514
> [ 6223.054980] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
> 0x00000000
> [ 6223.054982] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x00000000
> ...
> 
> I'm sure it is a different bug, can you confirm? dmesg.log stops to be
> populated when X/Gnome start, but typing dmesg in a terminal outputs tons of
> the reported messages. Should I open a new bug for it or has it been already
> reported?

Found something similar, it looks bug 58667.
Comment 11 Alexandre Demers 2013-01-03 22:05:19 UTC
I know there is a CS error, I see a message in the terminal just when the lock happens. This is where I should check in dmesg. This is the only thing I can confirm for now because of bug 58667 which floods my logs.
Comment 12 Alexandre Demers 2013-01-08 05:58:48 UTC
Just to let you know, commit http://cgit.freedesktop.org/mesa/mesa/commit/?id=4332f6fc185f968e7563e748b8c949021937c935 didn't solve the issue for this bug.
Comment 13 Alex Deucher 2013-01-08 13:50:28 UTC
Is there anything in the kernel log when this happens now that the mesa fix is applied?  Also does the patch in attachment 72013 [details] [review] help now that the mesa side is fixed?
Comment 14 Alexandre Demers 2013-01-09 01:24:33 UTC
(In reply to comment #13)
> Is there anything in the kernel log when this happens now that the mesa fix
> is applied?  Also does the patch in attachment 72013 [details] [review] [review] help
> now that the mesa side is fixed?

With or without the patch, it still ends saying the kernel rejected the CS and to check in dmesg. Then, it freezes. However, accessed through ssh, there is nothing I could get from it.

I killed Xorg remotely, the screen blinked for a moment and only garbage was displayed. I was able to retrieve something from dmesg. I killed it a second time to only get some different garbage.

I'll attach the file right away.
Comment 15 Alexandre Demers 2013-01-09 01:27:46 UTC
Created attachment 72694 [details]
dmesg from killing Xorg remotely when frozen with patch 72013 applied

This is with patch 72013 applied.
Comment 16 Alex Deucher 2013-01-09 13:58:29 UTC
Does a 3.8 kernel it work ok if you revert mesa back to cf5632094ba0c19d570ea47025cf6da75ef8457a?

I think
r600g: rework flusing and synchronization pattern v7
http://cgit.freedesktop.org/mesa/mesa/commit/?id=24b1206ab2dcd506aaac3ef656aebc8bc20cd27a
may be problematic on cayman.
Comment 17 Alexandre Demers 2013-01-10 03:03:57 UTC
(In reply to comment #16)
> Does a 3.8 kernel it work ok if you revert mesa back to
> cf5632094ba0c19d570ea47025cf6da75ef8457a?
> 
> I think
> r600g: rework flusing and synchronization pattern v7
> http://cgit.freedesktop.org/mesa/mesa/commit/
> ?id=24b1206ab2dcd506aaac3ef656aebc8bc20cd27a
> may be problematic on cayman.

If it is, not for this bug. Went back to cf563, applied a fix for glcpp, reloaded libraries and still locks at the same point, even after rebooting.
Comment 18 Alex Deucher 2013-01-10 18:04:37 UTC
Created attachment 72794 [details] [review]
possible fix

Does this kernel patch help?
Comment 19 Alexandre Demers 2013-01-11 00:58:34 UTC
(In reply to comment #18)
> Created attachment 72794 [details] [review] [review]
> possible fix
> 
> Does this kernel patch help?

No. I was able to catch something in errors.log and kernel.log though. I'm attaching the truncated file in a few seconds. I hit a GPU fault.

I'll do the same test without the patch to know if it is related or not.
Comment 20 Alexandre Demers 2013-01-11 01:01:57 UTC
Created attachment 72822 [details]
errors.log when tropics froze with patch 72794

It was originally about 53MB since it kept pumping messages until I hit the reset button. But it was all the same things over and over, so I truncated it.

Same messages were recorded in everything.log and kernel.log without any previous error messages.
Comment 21 Alexandre Demers 2013-01-17 04:07:02 UTC
(In reply to comment #19)
> (In reply to comment #18)
> > Created attachment 72794 [details] [review] [review] [review]
> > possible fix
> > 
> > Does this kernel patch help?
> 
> No. I was able to catch something in errors.log and kernel.log though. I'm
> attaching the truncated file in a few seconds. I hit a GPU fault.
> 
> I'll do the same test without the patch to know if it is related or not.

Just to let you know, it does the same thing either I apply the patch or not, even with today's latest kernel git. I just have to prepare to launch Tropics, connect through ssh from my tablet, launch Tropics and when it freezes, call dmesg from the tablet. Then, I'll have the GPU faults logged in my different log files.
Comment 22 Alex Deucher 2013-01-31 20:22:37 UTC
Created attachment 74014 [details] [review]
patch 1/2

Does this set of patches fix the issue?  I think we are running out of ring space for large VM page table updates since the DMA ring is smaller than the CP ring.
Comment 23 Alex Deucher 2013-01-31 20:23:10 UTC
Created attachment 74015 [details] [review]
patch 2/2
Comment 24 Alexandre Demers 2013-02-01 03:58:27 UTC
It fixes the thing! Good work! I've let it run for some time and it ran without any locks.
Comment 25 Alex Deucher 2013-02-01 13:48:29 UTC
I've switched back to the CP for 3.8 and 3.9 will contain the new patch.
Comment 26 Florian Mickler 2013-02-23 10:28:30 UTC
A patch referencing this bug report has been merged in Linux v3.8-rc7:

commit 3646e4209f2bd0d09022ed792e594fb4f559b86c
Author: Alex Deucher <alexander.deucher@amd.com>
Date:   Thu Jan 31 16:19:19 2013 -0500

    drm/radeon: switch back to the CP ring for VM PT updates


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.