Summary: | [polaris10, vega10][amd-staging-4.12, amd-staging-drm-next] GPU fault detected, somethimes lockup | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Arek Ruśniak <arek.rusi> | ||||||||
Component: | DRM/AMDgpu | Assignee: | Default DRI bug account <dri-devel> | ||||||||
Status: | RESOLVED FIXED | QA Contact: | |||||||||
Severity: | normal | ||||||||||
Priority: | medium | CC: | bug0xa3d2, ckoenig.leichtzumerken, vedran | ||||||||
Version: | DRI git | ||||||||||
Hardware: | x86-64 (AMD64) | ||||||||||
OS: | Linux (All) | ||||||||||
Whiteboard: | |||||||||||
i915 platform: | i915 features: | ||||||||||
Attachments: |
|
can you bisect? Ok, sometimes gnome session refuse to work or even crashes on it (leds blinking on my kb) 1753d85bc82849deeb68cb5d7883207f0acbddc4 is the first bad commit commit 1753d85bc82849deeb68cb5d7883207f0acbddc4 Author: Christian König <christian.koenig@amd.com> Date: Tue Aug 29 16:14:32 2017 +0200 drm/amdgpu: bump version for support of local BOs Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> :040000 040000 fb4af6a5aa54bac7afddeb83db09105ce7dab3e5 b30f9c48abef6ddd2bd3a34dd447c0543ca1b29e M drivers Created attachment 133923 [details]
dmesg for first bad commit
autostart throu gdm to gnome3-session failed so I booted into multi-user (no vm faults yet) and then started fluxbox from tty
(In reply to Alex Deucher from comment #1) > can you bisect? Hello all, I get the same on 'amd-staging-drm-next' since 1. of Sep (kernel build time: 1. Sep 02:14 CEST) update, too. Will go to bisect in the evening. [ 262.462941] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0a023d14 [ 262.462946] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00101540 [ 262.462949] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0903D014 [ 262.462952] amdgpu 0000:01:00.0: VM fault (0x14, vmid 4) at page 1054016, write from 'SDM1' (0x53 444d31) (61) (In reply to Dieter Nützel from comment #4) > (In reply to Alex Deucher from comment #1) > > can you bisect? > > Hello all, > > I get the same on 'amd-staging-drm-next' since 1. of Sep (kernel build time: > 1. Sep 02:14 CEST) update, too. Will go to bisect in the evening. > > [ 262.462941] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0a023d14 > [ 262.462946] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR > 0x00101540 > [ 262.462949] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS > 0x0903D014 > [ 262.462952] amdgpu 0000:01:00.0: VM fault (0x14, vmid 4) at page 1054016, > write from 'SDM1' (0x53 > 444d31) (61) Yes, git revert fd8bf087dffc commit fd8bf087dffc0bce047c5aea2afcb8f821e48db1 Author: Christian König <christian.koenig@amd.com> Date: Tue Aug 29 16:14:32 2017 +0200 drm/amdgpu: bump version for support of local BOs Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Solve it on 'amd-staging-drm-next', too. Today's build is fine. There's no vm-fault anymore. Thx for fix. Dieter could you confirm that for staging-next tree? Does patch "drm/amdgpu: fix moved list handling in the VM" fix the issue? Something wrong happened in my build environment or i've got just luck with earlier test.There's no fix. "GPU fault detected" still happenning. sorry for inconvinient (In reply to Dieter Nützel from comment #5) > (In reply to Dieter Nützel from comment #4) > > (In reply to Alex Deucher from comment #1) > > > can you bisect? > > > > Hello all, > > > > I get the same on 'amd-staging-drm-next' since 1. of Sep (kernel build time: > > 1. Sep 02:14 CEST) update, too. Will go to bisect in the evening. > > > > [ 262.462941] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0a023d14 > > [ 262.462946] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR > > 0x00101540 > > [ 262.462949] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS > > 0x0903D014 > > [ 262.462952] amdgpu 0000:01:00.0: VM fault (0x14, vmid 4) at page 1054016, > > write from 'SDM1' (0x53 > > 444d31) (61) > > Yes, > > git revert fd8bf087dffc > > commit fd8bf087dffc0bce047c5aea2afcb8f821e48db1 > Author: Christian König <christian.koenig@amd.com> > Date: Tue Aug 29 16:14:32 2017 +0200 > > drm/amdgpu: bump version for support of local BOs > > Signed-off-by: Christian König <christian.koenig@amd.com> > Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> > Signed-off-by: Alex Deucher <alexander.deucher@amd.com> > > Solve it on 'amd-staging-drm-next', too. Confirmed fixed by reverting on Vega 10. additional info: I try figure out why in my earlier test everything went ok and probably mesa is the trigger, Linux-amd-staging + Mesa-git + LLVM-svn - failure Linux-amd-staging + Mesa-git + LLVM 4.0.1 - failure Linux-amd-staging + Mesa 17.1.8 + LLVM 4.0.1 - works ok. I try later some bisecting, we will see. on mesa side looks like this is it: 214b565bc28bc4419f3eec29ab7bbe34080459fe is the first bad commit commit 214b565bc28bc4419f3eec29ab7bbe34080459fe Author: Christian König <christian.koenig@amd.com> Date: Tue Aug 29 16:45:46 2017 +0200 winsys/amdgpu: set AMDGPU_GEM_CREATE_VM_ALWAYS_VALID if possible v2 When the kernel supports it set the local flag and stop adding those BOs to the BO list. Can probably be optimized much more. v2: rename new flag to AMDGPU_GEM_CREATE_VM_ALWAYS_VALID Reviewed-by: Marek Olšák <marek.olsak@amd.com> :040000 040000 2e4b2737f37ede2bbdbbe6815fe0fa562177c2b7 3482c86ed92116adff7ab12b2d4de870746a1df6 M src To repeat my question: Does patch "drm/amdgpu: fix moved list handling in the VM" fix the issue? Do you guys have this in your kernel branch yet? If not that lockup is expected. Christian sorry, I thought that was clear. Yes, I updated ASAP so it contains: https://cgit.freedesktop.org/~agd5f/linux/commit/?h=amd-staging-4.12&id=8bd2cc0ab44b00346cc41f3ac828cbf992f6bc61 Doesn't help for vm-faults Every test right before and after your comment is for: linux-amd-staging-4.12-c5def4cbdb61 (In reply to Arek Ruśniak from comment #13) > Christian sorry, I thought that was clear. No problem, that just means that this is the same issue I'm still hunting for. Created attachment 134082 [details] [review] Possible fix Please try the attached kernel patch. Patch fixes issue. I've tried both staging-4.12 and staging-drm-next branches. Thanks Christian PS. It will be nice if Vedran could confirmed this for Vega before we close. (In reply to Christian König from comment #15) > Created attachment 134082 [details] [review] [review] > Possible fix > > Please try the attached kernel patch. Hello Christian, you've made your 'homework'...;-) > To repeat my question: Does patch "drm/amdgpu: fix moved list handling in > the VM" fix the issue? > > Do you guys have this in your kernel branch yet? If not that lockup is > expected. No, I haven't. It was fallen into the cranks of the repeated DC rebase of Alex's 'amd-staging-drm-next' tree (didn't noticed it for the last 7 days, Alex vacation). I'll make it short. NO that didn't solve it for me, too. But _this_ patch is GOLD: drm-amdgpu-fix-VM-sync-with-always-valid-BOs.mbox Tested-by: Dieter Nützel <Dieter@nuetzel-hh.de> Best 'glmark2' Score I've ever seen. RX580, 8 GB Xeon X3470, 4/8, 3 GHz 24 GB glmark2 Score: 6428 with additional load on the gfx cores through parallel running 'opencl-example/run_tests.sh' I got glmark2 Score: 7574 Good job! (In reply to Arek Ruśniak from comment #16) > Patch fixes issue. > I've tried both staging-4.12 and staging-drm-next branches. > Thanks Christian > > PS. It will be nice if Vedran could confirmed this for Vega before we close. I can confirm that after applying the patch the issue doesn't occur for me. (I hope that's enough, I can't claim more than that since I have done 2-3 upgrades of mesa/llvm since I last tested the broken kernel.) Bug 102500 might be related to bug 102598. I tried to apply patch attachment 134082 [details] [review] to amd-staging-4.12 (~agd5f/linux) kernel and drm-next-4.15-wip but it does not apply cleanly. I applied it manually to drm-next-4.15-wip and that kernel would not finish compiling. I confirm that bug 102500 and bug 102598 are the same. I split up the patch into 3 parts and they applied cleanly with offsets to drm-next-4.15-wip. I then reverted mesa to commit 214b565bc28bc4419f3eec29ab7bbe34080459fe (winsys/amdgpu: set AMDGPU_GEM_CREATE_VM_ALWAYS_VALID if possible v2) compiled and started X and corruption and lockups are gone. *** Bug 102598 has been marked as a duplicate of this bug. *** The patch has been included in amd-staging-drm-next for a while, should this bug be closed? |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 133914 [details] dmesg - start gnome 3 session only Hi, Afer today kernel update witcher 3 hangs. After restart (only Gnome3 session is running) in kernel log i see lot of: ... amdgpu 0000:01:00.0: GPU fault detected: 146 0x0d023d14 amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x001061A0 amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0903D014 amdgpu 0000:01:00.0: VM fault (0x14, vmid 4) at page 1073568, write from 'SDM1' (0x53444d31) (61) ... downgrade kernel resolve this issue. I use binary kernel from unofficial arch's repo: linux-amd-staging 4.12.0.680862.7be0a528b097 - the bad one linux-amd-staging-4.12.0.680853.eeb9985d7228 - works great these kernels are built from alex's git tree, branch amd-staging-4.12 OpenGL renderer string: AMD Radeon (TM) RX 470 Graphics (POLARIS10 / DRM 3.18.0 / 4.13.0-rc7-mainline, LLVM 6.0.0) OpenGL core profile version string: 4.5 (Core Profile) Mesa 17.3.0-devel (git-2d93b462b4)