Summary: | [bisected] r600g: use DMA engine for VM page table updates on cayman locks in Unigine Tropics | ||
---|---|---|---|
Product: | Mesa | Reporter: | Alexandre Demers <alexandre.f.demers> |
Component: | Drivers/Gallium/r600 | Assignee: | Default DRI bug account <dri-devel> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | critical | ||
Priority: | medium | CC: | florian |
Version: | git | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Bug Depends on: | 58667 | ||
Bug Blocks: | |||
Attachments: |
possible fix
dmesg after lockup dmesg from killing Xorg remotely when frozen with patch 72013 applied possible fix errors.log when tropics froze with patch 72794 patch 1/2 patch 2/2 |
Description
Alexandre Demers
2012-12-16 06:26:59 UTC
33e5467871b3007c4e6deea95b2cac38a55ff9f5 is the first bad commit commit 33e5467871b3007c4e6deea95b2cac38a55ff9f5 Author: Alex Deucher <alexander.deucher@amd.com> Date: Mon Oct 22 12:22:39 2012 -0400 drm/radeon: use DMA engine for VM page table updates on cayman/TN DMA engine has special packets to facilitate this and it also keeps the 3D engine free for other things. Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Obviously, this is on a 6950 Cayman. Can you get the dmesg output when the lockup happens? Hitting some other bug right now with bug 58655. I'll apply the proposed patch for the other bug and I'll see what I get then. Created attachment 72013 [details] [review] possible fix Does this patch fix the issue? Created attachment 72014 [details]
dmesg after lockup
This is the salvaged dmesg retrieved with the help of a ssh connection. Sadly, I don't think there is anything useful in there.
Would it help if I was increasing the debug level? (In reply to comment #5) > Created attachment 72013 [details] [review] [review] > possible fix > > Does this patch fix the issue? Testing right away. Doesn't fix it, it locks as before. Sadly, dmesg seems to loose the count because of another bug introduced in 3.8-rc1. Now that I moved to 3.8-rc1, there is a huge amount of messages appearing in errors.log and dmesg (when typed in the terminal): ... [ 6223.054880] radeon 0000:01:00.0: GPU fault detected: 146 0x00239514 [ 6223.054882] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 6223.054883] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [ 6223.054885] radeon 0000:01:00.0: GPU fault detected: 146 0x00135514 [ 6223.054887] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 6223.054889] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [ 6223.054891] radeon 0000:01:00.0: GPU fault detected: 146 0x00239514 [ 6223.054893] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 6223.054895] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [ 6223.054897] radeon 0000:01:00.0: GPU fault detected: 146 0x0033a514 [ 6223.054899] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 6223.054900] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [ 6223.054902] radeon 0000:01:00.0: GPU fault detected: 146 0x00136514 [ 6223.054904] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 6223.054906] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [ 6223.054908] radeon 0000:01:00.0: GPU fault detected: 146 0x0033a514 [ 6223.054910] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 6223.054912] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [ 6223.054914] radeon 0000:01:00.0: GPU fault detected: 146 0x0033a514 [ 6223.054916] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 6223.054918] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [ 6223.054920] radeon 0000:01:00.0: GPU fault detected: 146 0x00232514 [ 6223.054922] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 6223.054923] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [ 6223.054925] radeon 0000:01:00.0: GPU fault detected: 146 0x00232514 [ 6223.054927] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 6223.054930] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [ 6223.054932] radeon 0000:01:00.0: GPU fault detected: 146 0x0033d514 [ 6223.054934] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 6223.054936] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [ 6223.054938] radeon 0000:01:00.0: GPU fault detected: 146 0x0033d514 [ 6223.054940] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 6223.054942] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [ 6223.054944] radeon 0000:01:00.0: GPU fault detected: 146 0x00235514 [ 6223.054946] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 6223.054948] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [ 6223.054950] radeon 0000:01:00.0: GPU fault detected: 146 0x00235514 [ 6223.054952] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 6223.054954] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [ 6223.054956] radeon 0000:01:00.0: GPU fault detected: 146 0x0033e514 [ 6223.054958] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 6223.054960] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [ 6223.054962] radeon 0000:01:00.0: GPU fault detected: 146 0x0033e514 [ 6223.054963] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 6223.054965] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [ 6223.054967] radeon 0000:01:00.0: GPU fault detected: 146 0x00339514 [ 6223.054969] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 6223.054971] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [ 6223.054973] radeon 0000:01:00.0: GPU fault detected: 146 0x00339514 [ 6223.054975] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 6223.054977] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [ 6223.054979] radeon 0000:01:00.0: GPU fault detected: 146 0x00236514 [ 6223.054980] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 6223.054982] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 ... I'm sure it is a different bug, can you confirm? dmesg.log stops to be populated when X/Gnome start, but typing dmesg in a terminal outputs tons of the reported messages. Should I open a new bug for it or has it been already reported? (In reply to comment #9) > Doesn't fix it, it locks as before. Sadly, dmesg seems to loose the count > because of another bug introduced in 3.8-rc1. Now that I moved to 3.8-rc1, > there is a huge amount of messages appearing in errors.log and dmesg (when > typed in the terminal): > ... > [ 6223.054880] radeon 0000:01:00.0: GPU fault detected: 146 0x00239514 > [ 6223.054882] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR > 0x00000000 > [ 6223.054883] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS > 0x00000000 > [ 6223.054885] radeon 0000:01:00.0: GPU fault detected: 146 0x00135514 > [ 6223.054887] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR > 0x00000000 > [ 6223.054889] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS > 0x00000000 > [ 6223.054891] radeon 0000:01:00.0: GPU fault detected: 146 0x00239514 > [ 6223.054893] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR > 0x00000000 > [ 6223.054895] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS > 0x00000000 > [ 6223.054897] radeon 0000:01:00.0: GPU fault detected: 146 0x0033a514 > [ 6223.054899] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR > 0x00000000 > [ 6223.054900] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS > 0x00000000 > [ 6223.054902] radeon 0000:01:00.0: GPU fault detected: 146 0x00136514 > [ 6223.054904] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR > 0x00000000 > [ 6223.054906] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS > 0x00000000 > [ 6223.054908] radeon 0000:01:00.0: GPU fault detected: 146 0x0033a514 > [ 6223.054910] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR > 0x00000000 > [ 6223.054912] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS > 0x00000000 > [ 6223.054914] radeon 0000:01:00.0: GPU fault detected: 146 0x0033a514 > [ 6223.054916] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR > 0x00000000 > [ 6223.054918] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS > 0x00000000 > [ 6223.054920] radeon 0000:01:00.0: GPU fault detected: 146 0x00232514 > [ 6223.054922] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR > 0x00000000 > [ 6223.054923] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS > 0x00000000 > [ 6223.054925] radeon 0000:01:00.0: GPU fault detected: 146 0x00232514 > [ 6223.054927] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR > 0x00000000 > [ 6223.054930] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS > 0x00000000 > [ 6223.054932] radeon 0000:01:00.0: GPU fault detected: 146 0x0033d514 > [ 6223.054934] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR > 0x00000000 > [ 6223.054936] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS > 0x00000000 > [ 6223.054938] radeon 0000:01:00.0: GPU fault detected: 146 0x0033d514 > [ 6223.054940] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR > 0x00000000 > [ 6223.054942] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS > 0x00000000 > [ 6223.054944] radeon 0000:01:00.0: GPU fault detected: 146 0x00235514 > [ 6223.054946] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR > 0x00000000 > [ 6223.054948] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS > 0x00000000 > [ 6223.054950] radeon 0000:01:00.0: GPU fault detected: 146 0x00235514 > [ 6223.054952] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR > 0x00000000 > [ 6223.054954] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS > 0x00000000 > [ 6223.054956] radeon 0000:01:00.0: GPU fault detected: 146 0x0033e514 > [ 6223.054958] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR > 0x00000000 > [ 6223.054960] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS > 0x00000000 > [ 6223.054962] radeon 0000:01:00.0: GPU fault detected: 146 0x0033e514 > [ 6223.054963] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR > 0x00000000 > [ 6223.054965] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS > 0x00000000 > [ 6223.054967] radeon 0000:01:00.0: GPU fault detected: 146 0x00339514 > [ 6223.054969] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR > 0x00000000 > [ 6223.054971] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS > 0x00000000 > [ 6223.054973] radeon 0000:01:00.0: GPU fault detected: 146 0x00339514 > [ 6223.054975] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR > 0x00000000 > [ 6223.054977] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS > 0x00000000 > [ 6223.054979] radeon 0000:01:00.0: GPU fault detected: 146 0x00236514 > [ 6223.054980] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR > 0x00000000 > [ 6223.054982] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS > 0x00000000 > ... > > I'm sure it is a different bug, can you confirm? dmesg.log stops to be > populated when X/Gnome start, but typing dmesg in a terminal outputs tons of > the reported messages. Should I open a new bug for it or has it been already > reported? Found something similar, it looks bug 58667. I know there is a CS error, I see a message in the terminal just when the lock happens. This is where I should check in dmesg. This is the only thing I can confirm for now because of bug 58667 which floods my logs. Just to let you know, commit http://cgit.freedesktop.org/mesa/mesa/commit/?id=4332f6fc185f968e7563e748b8c949021937c935 didn't solve the issue for this bug. Is there anything in the kernel log when this happens now that the mesa fix is applied? Also does the patch in attachment 72013 [details] [review] help now that the mesa side is fixed? (In reply to comment #13) > Is there anything in the kernel log when this happens now that the mesa fix > is applied? Also does the patch in attachment 72013 [details] [review] [review] help > now that the mesa side is fixed? With or without the patch, it still ends saying the kernel rejected the CS and to check in dmesg. Then, it freezes. However, accessed through ssh, there is nothing I could get from it. I killed Xorg remotely, the screen blinked for a moment and only garbage was displayed. I was able to retrieve something from dmesg. I killed it a second time to only get some different garbage. I'll attach the file right away. Created attachment 72694 [details]
dmesg from killing Xorg remotely when frozen with patch 72013 applied
This is with patch 72013 applied.
Does a 3.8 kernel it work ok if you revert mesa back to cf5632094ba0c19d570ea47025cf6da75ef8457a? I think r600g: rework flusing and synchronization pattern v7 http://cgit.freedesktop.org/mesa/mesa/commit/?id=24b1206ab2dcd506aaac3ef656aebc8bc20cd27a may be problematic on cayman. (In reply to comment #16) > Does a 3.8 kernel it work ok if you revert mesa back to > cf5632094ba0c19d570ea47025cf6da75ef8457a? > > I think > r600g: rework flusing and synchronization pattern v7 > http://cgit.freedesktop.org/mesa/mesa/commit/ > ?id=24b1206ab2dcd506aaac3ef656aebc8bc20cd27a > may be problematic on cayman. If it is, not for this bug. Went back to cf563, applied a fix for glcpp, reloaded libraries and still locks at the same point, even after rebooting. Created attachment 72794 [details] [review] possible fix Does this kernel patch help? (In reply to comment #18) > Created attachment 72794 [details] [review] [review] > possible fix > > Does this kernel patch help? No. I was able to catch something in errors.log and kernel.log though. I'm attaching the truncated file in a few seconds. I hit a GPU fault. I'll do the same test without the patch to know if it is related or not. Created attachment 72822 [details]
errors.log when tropics froze with patch 72794
It was originally about 53MB since it kept pumping messages until I hit the reset button. But it was all the same things over and over, so I truncated it.
Same messages were recorded in everything.log and kernel.log without any previous error messages.
(In reply to comment #19) > (In reply to comment #18) > > Created attachment 72794 [details] [review] [review] [review] > > possible fix > > > > Does this kernel patch help? > > No. I was able to catch something in errors.log and kernel.log though. I'm > attaching the truncated file in a few seconds. I hit a GPU fault. > > I'll do the same test without the patch to know if it is related or not. Just to let you know, it does the same thing either I apply the patch or not, even with today's latest kernel git. I just have to prepare to launch Tropics, connect through ssh from my tablet, launch Tropics and when it freezes, call dmesg from the tablet. Then, I'll have the GPU faults logged in my different log files. Created attachment 74014 [details] [review] patch 1/2 Does this set of patches fix the issue? I think we are running out of ring space for large VM page table updates since the DMA ring is smaller than the CP ring. Created attachment 74015 [details] [review] patch 2/2 It fixes the thing! Good work! I've let it run for some time and it ran without any locks. I've switched back to the CP for 3.8 and 3.9 will contain the new patch. A patch referencing this bug report has been merged in Linux v3.8-rc7: commit 3646e4209f2bd0d09022ed792e594fb4f559b86c Author: Alex Deucher <alexander.deucher@amd.com> Date: Thu Jan 31 16:19:19 2013 -0500 drm/radeon: switch back to the CP ring for VM PT updates |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.