Bug 96360

Summary: [bisected: 3d02b7] VM fault with kernel 4.7-rc1 on Alien: Isolation
Product: DRI Reporter: Alexandre Demers <alexandre.f.demers>
Component: DRM/RadeonAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: medium CC: alexdeucher, bas
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
VM faults in dmesg triggered by Alien: Isolation
none
preliminary patch none

Description Alexandre Demers 2016-06-03 21:20:59 UTC
Created attachment 124312 [details]
VM faults in dmesg triggered by Alien: Isolation

Hi,

I've recently begun playing Alien: Isolation now that the compute shaders are available when combining latest Mesa with a 4.7-git kernel. However, my computer freezes after sometime. Poking dmesg got me a repeating VM fault which goes on as long as I'm in the game (playing, not the menu or while loading).

Setup:
GPU -> R9 280X
Kernel -> 4.7-rc1
Distribution -> Archlinux 64
Mesa, drm, ddx -> using latest code from git repositories

I'll try with a 4.6 kernel to see if this bug was introduced in or exposed by the 4.7 branch. 

Attaching dmesg
Comment 1 Alexandre Demers 2016-06-05 01:13:32 UTC
So this is the commit that exposes the bug.

3d02b7fee9c3ece1746f5b06c4143b511383fc6b is the first bad commit
commit 3d02b7fee9c3ece1746f5b06c4143b511383fc6b
Author: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Date:   Fri Apr 15 02:47:49 2016 +0200

    drm/radeon: Allow setting shader registers using DMA/COPY packet3 on SI.
    
    Mesa uses a COPY_DATA packet to copy the grid size for indirect dispatches
    into COMPUTE_USER_DATA_*.
    
    Setting those registers with a SET_SH_REG packet is allowed, not allowing
    them with other packets seems like an oversight.
    
    v2: Clarify commit message.
    
    Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

:040000 040000 ec4d58f4c4e5bd746474776c852a7c70763183a7 8ba67a00f061b454d1600838bf620e1e937e9461 M	drivers
Comment 2 Alexandre Demers 2016-06-08 05:34:28 UTC
Still a problem with today's latest mesa and kernel 4.7-rc2.
Comment 3 Nicolai Hähnle 2016-06-08 09:38:18 UTC
Can you provide an apitrace that reproduces the VM faults?
Comment 4 Alexandre Demers 2016-06-08 15:54:52 UTC
(In reply to Nicolai Hähnle from comment #3)
> Can you provide an apitrace that reproduces the VM faults?

I'll do that later. However, it may end up as a big apitrace since it takes a lot of time before being able to actually play de game, which is when the VM faults appear. I'll figure out a way of putting a link.
Comment 5 Nicolai Hähnle 2016-06-08 16:48:42 UTC
Thanks, that would be much appreciated. Most people tend to use Google Drive - I've downloaded GB-sized traces from there.
Comment 6 Alexandre Demers 2016-06-09 16:04:10 UTC
And here is the trace that you can download from Google Drive (I hope it works correctly):
https://drive.google.com/open?id=0Bw_tZdWsNa4BV0VKMGVaeFBDaEE
Comment 7 Nicolai Hähnle 2016-06-10 11:22:01 UTC
Thanks! No VM faults here on Tonga, so this may be specific to SI. Do you get VM faults in dmesg when you play the trace back on your system?
Comment 8 Alexandre Demers 2016-06-10 13:35:45 UTC
(In reply to Nicolai Hähnle from comment #7)
> Thanks! No VM faults here on Tonga, so this may be specific to SI. Do you
> get VM faults in dmesg when you play the trace back on your system?

Yes it does, I just tested it with the trace I've shared with you.
Comment 9 Alexandre Demers 2016-06-13 05:32:59 UTC
(In reply to Nicolai Hähnle from comment #7)
> Thanks! No VM faults here on Tonga, so this may be specific to SI. Do you
> get VM faults in dmesg when you play the trace back on your system?

Anything I can provide you with? Any specific test or steps?
Comment 10 Nicolai Hähnle 2016-06-13 11:04:07 UTC
No, thank you. I can reproduce this on a Verde, so it does seem to be SI-specific (perhaps also CI) or perhaps a radeon vs. amdgpu issue.

I've seen a VM fault even with GALLIUM_DDEBUG=800 (i.e. frequent flushes) happen at

901138 @3 glDispatchCompute(num_groups_x = 128, num_groups_y = 2, num_groups_z = 1)
Comment 11 Nicolai Hähnle 2016-06-13 21:06:12 UTC
Created attachment 124514 [details] [review]
preliminary patch

Problem understood - we're generating bad shader code - though I still need to double-check all the possible corner cases.

In the meantime, the attached patch for LLVM should fix Alien: Isolation.
Comment 12 Nicolai Hähnle 2016-06-15 07:20:46 UTC
Fixed in LLVM r272761.
Comment 13 Alexandre Demers 2016-06-15 16:50:05 UTC
(In reply to Nicolai Hähnle from comment #12)
> Fixed in LLVM r272761.

Thank you, I'll test it later today.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.