|Summary:||Tonga Elemental segfault + VM faults since radeon: implement r600_query_hw_get_result via function pointers|
|Product:||DRI||Reporter:||Andy Furniss <adf.lists>|
|Component:||DRM/AMDgpu||Assignee:||Default DRI bug account <dri-devel>|
|Status:||RESOLVED FIXED||QA Contact:|
|i915 platform:||i915 features:|
Description Andy Furniss 2015-11-19 14:45:28 UTC
Unreal 4.5 Elemental demo on r9 285 using powerplay kernel. Since mesa commit - commit 50f0f938e3a577647fdfb6bdbb4ad3da252aa791 Author: Nicolai Hähnle <email@example.com> Date: Fri Nov 13 00:27:34 2015 +0100 radeon: implement r600_query_hw_get_result via function pointers We will need the clear_result override for the batch query implementation. About a minute into the demo (always same place) the demo will catch a segfault and quit. In dmesg I see a few VM faults. While confirming the bisect I see that though it doesn't crash on the commit before above = commit c207c55fc08a1bf3dd40e79b3aaec34afbee2e55 Author: Nicolai Hähnle <firstname.lastname@example.org> Date: Wed Nov 18 12:05:11 2015 +0100 radeon: split hw query buffer handling from cs emit The idea here is that driver queries implemented outside of common code will use the same query buffer handling with different logic for starting and stopping the corresponding counters. At the point where it would have crashed I start getting flooded with VM faults [17771.298259] VM fault (0x14, vmid 5) at page 1204016, write from 'TC0' (0x54433000) (8) [17771.330661] amdgpu 0000:01:00.0: GPU fault detected: 146 0x04c20814 [17771.330665] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00125E98 [17771.330666] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B008014 [17771.330668] VM fault (0x14, vmid 5) at page 1203864, write from 'TC0' (0x54433000) (8) [17771.363320] amdgpu 0000:01:00.0: GPU fault detected: 146 0x05e20814 [17771.363323] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x001264BC [17771.363325] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B008014 [17771.363326] VM fault (0x14, vmid 5) at page 1205436, write from 'TC0' (0x54433000) (8) [17771.395828] amdgpu 0000:01:00.0: GPU fault detected: 146 0x06620814 [17771.395832] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x001260CC [17771.395833] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B008014 [17771.395834] VM fault (0x14, vmid 5) at page 1204428, write from 'TC0' (0x54433000) (8)
Comment 1 Michel Dänzer 2015-11-20 00:29:59 UTC
Nicolai, any ideas?
Comment 2 Nicolai Hähnle 2015-11-20 10:23:32 UTC
Hi Andy, thanks for the report! I can reproduce the crash, it does indeed seem to be related to buffer handling, I am investigating.
Comment 3 Nicolai Hähnle 2015-11-20 13:01:41 UTC
Created attachment 119980 [details] [review] patch that should fix the bug
Comment 4 Nicolai Hähnle 2015-11-20 13:03:40 UTC
Created attachment 119981 [details] [review] related patch Okay, so I understand what failed and why it worked before. Could you please test both patches? The first one should fix your problem, the second one is a related cleanup on top of it that hopefully contains no regressions. [I have apparently unrelated weirdness going on right now which prevents me from testing this properly.]
Comment 5 Mathias Tillman 2015-11-20 15:13:14 UTC
Had this problem too, and the patch seems to have fixed it for me.
Comment 6 Andy Furniss 2015-11-20 16:51:44 UTC
Patch one fixes it for me and I can't find any regressions with patch one + patch two.
Comment 7 Nicolai Hähnle 2015-11-20 21:48:37 UTC
Thanks for testing! The patches are in Mesa master, and the bug doesn't affect any of the stable releases, hence closing it.