Bug 102500 - [polaris10, vega10][amd-staging-4.12, amd-staging-drm-next] GPU fault detected, somethimes lockup
Summary: [polaris10, vega10][amd-staging-4.12, amd-staging-drm-next] GPU fault detecte...
Status: RESOLVED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: DRI git
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
: 102598 (view as bug list)
Depends on:
Blocks:
 
Reported: 2017-08-31 20:44 UTC by Arek Ruśniak
Modified: 2017-10-03 23:13 UTC (History)
3 users (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg - start gnome 3 session only (73.19 KB, text/plain)
2017-08-31 20:44 UTC, Arek Ruśniak
no flags Details
dmesg for first bad commit (69.15 KB, text/plain)
2017-09-01 10:04 UTC, Arek Ruśniak
no flags Details
Possible fix (3.83 KB, patch)
2017-09-08 12:19 UTC, Christian König
no flags Details | Splinter Review

Description Arek Ruśniak 2017-08-31 20:44:45 UTC
Created attachment 133914 [details]
dmesg - start gnome 3 session only

Hi, 
Afer today kernel update witcher 3 hangs. After restart (only Gnome3 session is running) in kernel log i see lot of:
...
amdgpu 0000:01:00.0: GPU fault detected: 146 0x0d023d14
amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001061A0
amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0903D014
amdgpu 0000:01:00.0: VM fault (0x14, vmid 4) at page 1073568, write from 'SDM1' (0x53444d31) (61)
...
downgrade kernel resolve this issue. 

I use binary kernel from unofficial arch's repo:
linux-amd-staging 4.12.0.680862.7be0a528b097 - the bad one 
linux-amd-staging-4.12.0.680853.eeb9985d7228 - works great

these kernels are built from alex's git tree, branch amd-staging-4.12

OpenGL renderer string: AMD Radeon (TM) RX 470 Graphics (POLARIS10 / DRM 3.18.0 / 4.13.0-rc7-mainline, LLVM 6.0.0)
OpenGL core profile version string: 4.5 (Core Profile) Mesa 17.3.0-devel (git-2d93b462b4)
Comment 1 Alex Deucher 2017-08-31 21:58:05 UTC
can you bisect?
Comment 2 Arek Ruśniak 2017-09-01 09:55:19 UTC
Ok, sometimes gnome session refuse to work or even crashes on it (leds blinking on my kb)  

1753d85bc82849deeb68cb5d7883207f0acbddc4 is the first bad commit
commit 1753d85bc82849deeb68cb5d7883207f0acbddc4
Author: Christian König <christian.koenig@amd.com>
Date:   Tue Aug 29 16:14:32 2017 +0200

    drm/amdgpu: bump version for support of local BOs
    
    Signed-off-by: Christian König <christian.koenig@amd.com>
    Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>

:040000 040000 fb4af6a5aa54bac7afddeb83db09105ce7dab3e5 b30f9c48abef6ddd2bd3a34dd447c0543ca1b29e M	drivers
Comment 3 Arek Ruśniak 2017-09-01 10:04:34 UTC
Created attachment 133923 [details]
dmesg for first bad commit

autostart throu gdm to gnome3-session failed so I booted into multi-user (no vm faults yet) and then started fluxbox from tty
Comment 4 Dieter Nützel 2017-09-01 15:37:10 UTC
(In reply to Alex Deucher from comment #1)
> can you bisect?

Hello all,

I get the same on 'amd-staging-drm-next' since 1. of Sep (kernel build time: 1. Sep 02:14 CEST) update, too. Will go to bisect in the evening.

[  262.462941] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0a023d14
[  262.462946] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00101540
[  262.462949] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0903D014
[  262.462952] amdgpu 0000:01:00.0: VM fault (0x14, vmid 4) at page 1054016, write from 'SDM1' (0x53
444d31) (61)
Comment 5 Dieter Nützel 2017-09-01 21:36:43 UTC
(In reply to Dieter Nützel from comment #4)
> (In reply to Alex Deucher from comment #1)
> > can you bisect?
> 
> Hello all,
> 
> I get the same on 'amd-staging-drm-next' since 1. of Sep (kernel build time:
> 1. Sep 02:14 CEST) update, too. Will go to bisect in the evening.
> 
> [  262.462941] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0a023d14
> [  262.462946] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
> 0x00101540
> [  262.462949] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x0903D014
> [  262.462952] amdgpu 0000:01:00.0: VM fault (0x14, vmid 4) at page 1054016,
> write from 'SDM1' (0x53
> 444d31) (61)

Yes,

git revert fd8bf087dffc

commit fd8bf087dffc0bce047c5aea2afcb8f821e48db1
Author: Christian König <christian.koenig@amd.com>
Date:   Tue Aug 29 16:14:32 2017 +0200

    drm/amdgpu: bump version for support of local BOs
    
    Signed-off-by: Christian König <christian.koenig@amd.com>
    Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

Solve it on 'amd-staging-drm-next', too.
Comment 6 Arek Ruśniak 2017-09-02 07:37:37 UTC
Today's build is fine. There's no vm-fault anymore.
Thx for fix.
Dieter could you confirm that for staging-next tree?
Comment 7 Christian König 2017-09-02 07:38:25 UTC
Does patch "drm/amdgpu: fix moved list handling in the VM" fix the issue?
Comment 8 Arek Ruśniak 2017-09-02 10:30:49 UTC
Something wrong happened in my build environment or i've got just luck with earlier test.There's no fix. 
"GPU fault detected" still happenning.

sorry for inconvinient
Comment 9 Vedran Miletić 2017-09-03 16:34:24 UTC
(In reply to Dieter Nützel from comment #5)
> (In reply to Dieter Nützel from comment #4)
> > (In reply to Alex Deucher from comment #1)
> > > can you bisect?
> > 
> > Hello all,
> > 
> > I get the same on 'amd-staging-drm-next' since 1. of Sep (kernel build time:
> > 1. Sep 02:14 CEST) update, too. Will go to bisect in the evening.
> > 
> > [  262.462941] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0a023d14
> > [  262.462946] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
> > 0x00101540
> > [  262.462949] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
> > 0x0903D014
> > [  262.462952] amdgpu 0000:01:00.0: VM fault (0x14, vmid 4) at page 1054016,
> > write from 'SDM1' (0x53
> > 444d31) (61)
> 
> Yes,
> 
> git revert fd8bf087dffc
> 
> commit fd8bf087dffc0bce047c5aea2afcb8f821e48db1
> Author: Christian König <christian.koenig@amd.com>
> Date:   Tue Aug 29 16:14:32 2017 +0200
> 
>     drm/amdgpu: bump version for support of local BOs
>     
>     Signed-off-by: Christian König <christian.koenig@amd.com>
>     Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
>     Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> 
> Solve it on 'amd-staging-drm-next', too.

Confirmed fixed by reverting on Vega 10.
Comment 10 Arek Ruśniak 2017-09-07 20:13:32 UTC
additional info:
I try figure out why in my earlier test everything went ok and probably mesa is the trigger, 

Linux-amd-staging + Mesa-git + LLVM-svn - failure
Linux-amd-staging + Mesa-git + LLVM 4.0.1 - failure
Linux-amd-staging + Mesa 17.1.8 + LLVM 4.0.1 - works ok. 
I try later some bisecting, we will see.
Comment 11 Arek Ruśniak 2017-09-07 23:23:16 UTC
on mesa side looks like this is it:

214b565bc28bc4419f3eec29ab7bbe34080459fe is the first bad commit
commit 214b565bc28bc4419f3eec29ab7bbe34080459fe
Author: Christian König <christian.koenig@amd.com>
Date:   Tue Aug 29 16:45:46 2017 +0200

    winsys/amdgpu: set AMDGPU_GEM_CREATE_VM_ALWAYS_VALID if possible v2
    
    When the kernel supports it set the local flag and
    stop adding those BOs to the BO list.
    
    Can probably be optimized much more.
    
    v2: rename new flag to AMDGPU_GEM_CREATE_VM_ALWAYS_VALID
    
    Reviewed-by: Marek Olšák <marek.olsak@amd.com>

:040000 040000 2e4b2737f37ede2bbdbbe6815fe0fa562177c2b7 3482c86ed92116adff7ab12b2d4de870746a1df6 M	src
Comment 12 Christian König 2017-09-08 08:18:06 UTC
To repeat my question: Does patch "drm/amdgpu: fix moved list handling in the VM" fix the issue?

Do you guys have this in your kernel branch yet? If not that lockup is expected.
Comment 13 Arek Ruśniak 2017-09-08 10:13:59 UTC
Christian sorry, I thought that was clear. 
Yes, I updated ASAP so it contains:
https://cgit.freedesktop.org/~agd5f/linux/commit/?h=amd-staging-4.12&id=8bd2cc0ab44b00346cc41f3ac828cbf992f6bc61
Doesn't help for vm-faults

Every test right before and after your comment is for: 
linux-amd-staging-4.12-c5def4cbdb61
Comment 14 Christian König 2017-09-08 11:08:40 UTC
(In reply to Arek Ruśniak from comment #13)
> Christian sorry, I thought that was clear. 

No problem, that just means that this is the same issue I'm still hunting for.
Comment 15 Christian König 2017-09-08 12:19:35 UTC
Created attachment 134082 [details] [review]
Possible fix

Please try the attached kernel patch.
Comment 16 Arek Ruśniak 2017-09-08 15:54:38 UTC
Patch fixes issue.
I've tried both staging-4.12 and staging-drm-next branches.
Thanks Christian

PS. It will be nice if Vedran could confirmed this for Vega before we close.
Comment 17 Dieter Nützel 2017-09-09 01:49:03 UTC
(In reply to Christian König from comment #15)
> Created attachment 134082 [details] [review] [review]
> Possible fix
> 
> Please try the attached kernel patch.

Hello Christian,

you've made your 'homework'...;-)

> To repeat my question: Does patch "drm/amdgpu: fix moved list handling in
> the VM" fix the issue?
>
> Do you guys have this in your kernel branch yet? If not that lockup is
> expected.

No, I haven't.
It was fallen into the cranks of the repeated DC rebase of Alex's 'amd-staging-drm-next' tree (didn't noticed it for the last 7 days, Alex vacation). 
I'll make it short. NO that didn't solve it for me, too.

But _this_ patch is GOLD:
drm-amdgpu-fix-VM-sync-with-always-valid-BOs.mbox

Tested-by: Dieter Nützel <Dieter@nuetzel-hh.de>

Best 'glmark2' Score I've ever seen.
RX580, 8 GB
Xeon X3470, 4/8, 3 GHz
24 GB

glmark2 Score: 6428

with additional load on the gfx cores through parallel running 'opencl-example/run_tests.sh' I got

glmark2 Score: 7574

Good job!
Comment 18 Vedran Miletić 2017-09-09 19:11:34 UTC
(In reply to Arek Ruśniak from comment #16)
> Patch fixes issue.
> I've tried both staging-4.12 and staging-drm-next branches.
> Thanks Christian
> 
> PS. It will be nice if Vedran could confirmed this for Vega before we close.

I can confirm that after applying the patch the issue doesn't occur for me. (I hope that's enough, I can't claim more than that since I have done 2-3 upgrades of mesa/llvm since I last tested the broken kernel.)
Comment 19 charlie 2017-09-09 19:52:20 UTC
Bug 102500 might be related to bug 102598.

I tried to apply patch attachment 134082 [details] [review] to  amd-staging-4.12 (~agd5f/linux) kernel and drm-next-4.15-wip but it does not apply cleanly.  I applied it manually to drm-next-4.15-wip and that kernel would not finish compiling.
Comment 20 charlie 2017-09-10 04:14:13 UTC
I confirm that bug 102500 and bug 102598 are the same.

I split up the patch into 3 parts and they applied cleanly with offsets to drm-next-4.15-wip.

I then reverted mesa to commit 214b565bc28bc4419f3eec29ab7bbe34080459fe (winsys/amdgpu: set AMDGPU_GEM_CREATE_VM_ALWAYS_VALID if possible v2) compiled and started X and corruption and lockups are gone.
Comment 21 charlie 2017-09-10 04:21:50 UTC
*** Bug 102598 has been marked as a duplicate of this bug. ***
Comment 22 Vedran Miletić 2017-10-03 21:36:17 UTC
The patch has been included in amd-staging-drm-next for a while, should this bug be closed?


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.