Bug 93015

Summary:

Tonga Elemental segfault + VM faults since radeon: implement r600_query_hw_get_result via function pointers

Product:

DRI

Reporter:

Andy Furniss <adf.lists>

Component:

DRM/AMDgpu

Assignee:

Default DRI bug account <dri-devel>

Status:

RESOLVED FIXED

QA Contact:

Severity:

normal

Priority:

medium

CC:

nhaehnle

Version:

DRI git

Hardware:

x86-64 (AMD64)

OS:

Linux (All)

Whiteboard:

i915 platform:

i915 features:

Attachments:

Description	Flags
patch that should fix the bug	none
related patch	none

Description Andy Furniss 2015-11-19 14:45:28 UTC

Unreal 4.5 Elemental demo on r9 285 using powerplay kernel.

Since mesa commit -

commit 50f0f938e3a577647fdfb6bdbb4ad3da252aa791
Author: Nicolai Hähnle <nhaehnle@gmail.com>
Date:   Fri Nov 13 00:27:34 2015 +0100

    radeon: implement r600_query_hw_get_result via function pointers
    
    We will need the clear_result override for the batch query implementation.

About a minute into the demo (always same place) the demo will catch a segfault and quit.

In dmesg I see a few VM faults.

While confirming the bisect I see that though it doesn't crash on the commit before above =

commit c207c55fc08a1bf3dd40e79b3aaec34afbee2e55
Author: Nicolai Hähnle <nhaehnle@gmail.com>
Date:   Wed Nov 18 12:05:11 2015 +0100

    radeon: split hw query buffer handling from cs emit
    
    The idea here is that driver queries implemented outside of common code
    will use the same query buffer handling with different logic for starting
    and stopping the corresponding counters.

At the point where it would have crashed I start getting flooded with VM faults

[17771.298259] VM fault (0x14, vmid 5) at page 1204016, write from 'TC0' (0x54433000) (8)
[17771.330661] amdgpu 0000:01:00.0: GPU fault detected: 146 0x04c20814
[17771.330665] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00125E98
[17771.330666] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B008014
[17771.330668] VM fault (0x14, vmid 5) at page 1203864, write from 'TC0' (0x54433000) (8)
[17771.363320] amdgpu 0000:01:00.0: GPU fault detected: 146 0x05e20814
[17771.363323] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001264BC
[17771.363325] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B008014
[17771.363326] VM fault (0x14, vmid 5) at page 1205436, write from 'TC0' (0x54433000) (8)
[17771.395828] amdgpu 0000:01:00.0: GPU fault detected: 146 0x06620814
[17771.395832] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001260CC
[17771.395833] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B008014
[17771.395834] VM fault (0x14, vmid 5) at page 1204428, write from 'TC0' (0x54433000) (8)

Comment 1 Michel Dänzer 2015-11-20 00:29:59 UTC

Nicolai, any ideas?

Comment 2 Nicolai Hähnle 2015-11-20 10:23:32 UTC

Hi Andy, thanks for the report! I can reproduce the crash, it does indeed seem to be related to buffer handling, I am investigating.

Comment 3 Nicolai Hähnle 2015-11-20 13:01:41 UTC

Created attachment 119980 [details] [review]
patch that should fix the bug

Comment 4 Nicolai Hähnle 2015-11-20 13:03:40 UTC

Created attachment 119981 [details] [review]
related patch

Okay, so I understand what failed and why it worked before.

Could you please test both patches? The first one should fix your problem, the second one is a related cleanup on top of it that hopefully contains no regressions.

[I have apparently unrelated weirdness going on right now which prevents me from testing this properly.]

Comment 5 Mathias Tillman 2015-11-20 15:13:14 UTC

Had this problem too, and the patch seems to have fixed it for me.

Comment 6 Andy Furniss 2015-11-20 16:51:44 UTC

Patch one fixes it for me and I can't find any regressions with patch one + patch two.

Comment 7 Nicolai Hähnle 2015-11-20 21:48:37 UTC

Thanks for testing! The patches are in Mesa master, and the bug doesn't affect any of the stable releases, hence closing it.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.