Bug 102666

Summary: amdgpu_vm_bo_invalidate NULL reference in amd-staging-drm-next
Product: DRI Reporter: Bas Nieuwenhuizen <bas>
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: CLOSED WORKSFORME QA Contact:
Severity: normal    
Priority: medium    
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg none

Description Bas Nieuwenhuizen 2017-09-11 22:22:57 UTC
Created attachment 134171 [details]
dmesg

I'm getting a 

[  404.518419] BUG: unable to handle kernel NULL pointer dereference at 0000000000000220
[  404.518445] IP: amdgpu_vm_bo_invalidate+0x71/0x150 [amdgpu]


when running vulkan cts with 32 processes (with tests that cause OOM removed).

Current linux tip:

commit 2dd9dc59c1419c090b084461165bd8b0adf1fecb (HEAD -> amd-staging-drm-next, origin/amd-staging-drm-next)
Author: Harry Wentland <harry.wentland@amd.com>
Date:   Thu Aug 31 21:17:05 2017 -0400

    drm/amdgpu: Remove unused flip_flags from amdgpu_crtc


It doesn't seem like there is a correlating hang: the card is clocked down and /sys/kernel/debug/dri/0/amdgpu_fence_info shows no pending fences. However, eventually some of the CTS processes get stuck, and I can't kill them gdb into them etc. Probably a pagefault that gets stuck, since fence waiting doesn't seem to get stuck easily? Either way, not sure if that is related yet.

AFAICT the issue is that vm->root.base.bo is NULL in

if (evicted && bo->tbo.resv == vm->root.base.bo->tbo.resv) {
Comment 1 Bas Nieuwenhuizen 2018-06-20 22:41:50 UTC
I haven't had this in a long while, seems to be fixed for a while.
Comment 2 Christian König 2018-06-21 09:24:44 UTC
Ok in this case let's close this.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.