Bug 93015 - Tonga Elemental segfault + VM faults since radeon: implement r600_query_hw_get_result via function pointers
Summary: Tonga Elemental segfault + VM faults since radeon: implement r600_query_hw_g...
Status: RESOLVED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: DRI git
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-11-19 14:45 UTC by Andy Furniss
Modified: 2015-11-20 21:48 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
patch that should fix the bug (2.21 KB, patch)
2015-11-20 13:01 UTC, Nicolai Hähnle
no flags Details | Splinter Review
related patch (4.42 KB, patch)
2015-11-20 13:03 UTC, Nicolai Hähnle
no flags Details | Splinter Review

Description Andy Furniss 2015-11-19 14:45:28 UTC
Unreal 4.5 Elemental demo on r9 285 using powerplay kernel.

Since mesa commit -

commit 50f0f938e3a577647fdfb6bdbb4ad3da252aa791
Author: Nicolai Hähnle <nhaehnle@gmail.com>
Date:   Fri Nov 13 00:27:34 2015 +0100

    radeon: implement r600_query_hw_get_result via function pointers
    
    We will need the clear_result override for the batch query implementation.

About a minute into the demo (always same place) the demo will catch a segfault and quit.

In dmesg I see a few VM faults.

While confirming the bisect I see that though it doesn't crash on the commit before above =

commit c207c55fc08a1bf3dd40e79b3aaec34afbee2e55
Author: Nicolai Hähnle <nhaehnle@gmail.com>
Date:   Wed Nov 18 12:05:11 2015 +0100

    radeon: split hw query buffer handling from cs emit
    
    The idea here is that driver queries implemented outside of common code
    will use the same query buffer handling with different logic for starting
    and stopping the corresponding counters.

At the point where it would have crashed I start getting flooded with VM faults

[17771.298259] VM fault (0x14, vmid 5) at page 1204016, write from 'TC0' (0x54433000) (8)
[17771.330661] amdgpu 0000:01:00.0: GPU fault detected: 146 0x04c20814
[17771.330665] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00125E98
[17771.330666] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B008014
[17771.330668] VM fault (0x14, vmid 5) at page 1203864, write from 'TC0' (0x54433000) (8)
[17771.363320] amdgpu 0000:01:00.0: GPU fault detected: 146 0x05e20814
[17771.363323] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001264BC
[17771.363325] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B008014
[17771.363326] VM fault (0x14, vmid 5) at page 1205436, write from 'TC0' (0x54433000) (8)
[17771.395828] amdgpu 0000:01:00.0: GPU fault detected: 146 0x06620814
[17771.395832] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001260CC
[17771.395833] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B008014
[17771.395834] VM fault (0x14, vmid 5) at page 1204428, write from 'TC0' (0x54433000) (8)
Comment 1 Michel Dänzer 2015-11-20 00:29:59 UTC
Nicolai, any ideas?
Comment 2 Nicolai Hähnle 2015-11-20 10:23:32 UTC
Hi Andy, thanks for the report! I can reproduce the crash, it does indeed seem to be related to buffer handling, I am investigating.
Comment 3 Nicolai Hähnle 2015-11-20 13:01:41 UTC
Created attachment 119980 [details] [review]
patch that should fix the bug
Comment 4 Nicolai Hähnle 2015-11-20 13:03:40 UTC
Created attachment 119981 [details] [review]
related patch

Okay, so I understand what failed and why it worked before.

Could you please test both patches? The first one should fix your problem, the second one is a related cleanup on top of it that hopefully contains no regressions.

[I have apparently unrelated weirdness going on right now which prevents me from testing this properly.]
Comment 5 Mathias Tillman 2015-11-20 15:13:14 UTC
Had this problem too, and the patch seems to have fixed it for me.
Comment 6 Andy Furniss 2015-11-20 16:51:44 UTC
Patch one fixes it for me and I can't find any regressions with patch one + patch two.
Comment 7 Nicolai Hähnle 2015-11-20 21:48:37 UTC
Thanks for testing! The patches are in Mesa master, and the bug doesn't affect any of the stable releases, hence closing it.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.