Bug 85207 - agd5f drm-next-3.19-wip + Unreal Elemental sometimes = list_add corruption/hung task
Summary: agd5f drm-next-3.19-wip + Unreal Elemental sometimes = list_add corruption/hu...
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Radeon (show other bugs)
Version: XOrg git
Hardware: Other All
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
: 88211 (view as bug list)
Depends on:
Blocks:
 
Reported: 2014-10-19 21:07 UTC by Andy Furniss
Modified: 2015-08-02 11:41 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg when Unreal Elemental hangs on start (76.93 KB, text/plain)
2014-10-19 21:07 UTC, Andy Furniss
no flags Details
Possible fix (1.17 KB, patch)
2014-10-21 09:54 UTC, Christian König
no flags Details | Splinter Review
Fix for printing the error message (868 bytes, patch)
2015-01-08 15:48 UTC, Christian König
no flags Details | Splinter Review

Description Andy Furniss 2014-10-19 21:07:33 UTC
Created attachment 108075 [details]
dmesg when Unreal Elemental hangs on start

R9270X Sometime running unreal elemental demo it hangs at startup with errors in dmesg attached.

This doesn't always happen.

Mesa is currently on winsys/radeon: Use a single buffer cache manager again, previously produced with slightly older.

Haven't seen on drm-next-3.18-wip (but really need to test more with current mesa)

Possibly unrelated, but new for drm-next-3.19-wip I get below when running Unigine Valley - it runs OK.

Oct 17 11:15:35 ph4 kernel: radeon 0000:01:00.0: GPU fault detected: 146 0x0af03504
Oct 17 11:15:35 ph4 kernel: radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00010E57
Oct 17 11:15:35 ph4 kernel: radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x10035004
Oct 17 11:15:35 ph4 kernel: VM fault (0x04, vmid 8) at page 69207, read from VGT (53)
Comment 1 Andy Furniss 2014-10-19 21:14:23 UTC
Also noticed in that dmesg and searching kern log that I sometimes get apparently without effect -

kernel: [drm:radeon_gem_va_update_vm] *ERROR* Couldn't update BO_VA (-512)

With this kernel.
Comment 2 Michel Dänzer 2014-10-20 09:21:16 UTC
(In reply to Andy Furniss from comment #0)
> Haven't seen on drm-next-3.18-wip

Can you bisect the kernel?
Comment 3 Andy Furniss 2014-10-20 17:13:51 UTC
(In reply to Michel Dänzer from comment #2)
> (In reply to Andy Furniss from comment #0)
> > Haven't seen on drm-next-3.18-wip
> 
> Can you bisect the kernel?

May be a bit early, but I will sit on the one before for a while to confirm.

Looks like the head commit -

commit bb9a49819ed30f3f5782b2504066547a8507a591
Author: Christian König <christian.koenig@amd.com>
Date:   Mon Oct 13 12:41:47 2014 +0200

    drm/radeon: update the VM after setting BO address
    
    This way the necessary VM update is kicked off immediately
    if all BOs involved are in GPU accessible memory.

I haven't managed to lock or get Valley to GPU fault on the one before so far.

FWIW I noticed even on head the valley fault doesn't always happen - it seems that I need to have set my CPUs to perf (which I nearly always do when testing things like this). With cpufreq ondemand I didn't see the fault.
Comment 4 Christian König 2014-10-21 09:54:23 UTC
Created attachment 108165 [details] [review]
Possible fix

Ups! Forgotten to take the VM lock in radeon_gem_va_update_vm. Fix is attached.

Thanks for testing,
Christian.
Comment 5 Andy Furniss 2014-10-21 13:11:09 UTC
(In reply to Christian König from comment #4)
> Created attachment 108165 [details] [review] [review]
> Possible fix
> 
> Ups! Forgotten to take the VM lock in radeon_gem_va_update_vm. Fix is
> attached.
> 
> Thanks for testing,
> Christian.

I don't know about Elemental as it's far harder to trigger, but first try with valley produced -

[  156.617954] radeon 0000:01:00.0: GPU fault detected: 146 0x02e83504
[  156.617960] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00010F17
[  156.617961] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x08035004
[  156.617963] VM fault (0x04, vmid 4) at page 69399, read from VGT (53)
Comment 6 Christian König 2014-10-22 13:45:44 UTC
(In reply to Andy Furniss from comment #5)
> I don't know about Elemental as it's far harder to trigger, but first try
> with valley produced -
> 
> [  156.617954] radeon 0000:01:00.0: GPU fault detected: 146 0x02e83504
> [  156.617960] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
> 0x00010F17
> [  156.617961] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x08035004
> [  156.617963] VM fault (0x04, vmid 4) at page 69399, read from VGT (53)

Sounds like a different problem triggered by the same patchset to me.

But first things first, is the original issue with the list corruption fixed? If yes we can start to look into this one as well.
Comment 7 Andy Furniss 2014-10-23 09:03:53 UTC
(In reply to Christian König from comment #6)
> (In reply to Andy Furniss from comment #5)
> > I don't know about Elemental as it's far harder to trigger, but first try
> > with valley produced -
> > 
> > [  156.617954] radeon 0000:01:00.0: GPU fault detected: 146 0x02e83504
> > [  156.617960] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
> > 0x00010F17
> > [  156.617961] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
> > 0x08035004
> > [  156.617963] VM fault (0x04, vmid 4) at page 69399, read from VGT (53)
> 
> Sounds like a different problem triggered by the same patchset to me.
> 
> But first things first, is the original issue with the list corruption
> fixed? If yes we can start to look into this one as well.

It's OK so far, but then I need more time as I don't really know how to trigger it and last time I called it as OK (in another bug) it wasn't.
Comment 8 Andy Furniss 2014-10-23 22:17:04 UTC
(In reply to Andy Furniss from comment #7)
> (In reply to Christian König from comment #6)
> > (In reply to Andy Furniss from comment #5)
> > > I don't know about Elemental as it's far harder to trigger, but first try
> > > with valley produced -
> > > 
> > > [  156.617954] radeon 0000:01:00.0: GPU fault detected: 146 0x02e83504
> > > [  156.617960] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
> > > 0x00010F17
> > > [  156.617961] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
> > > 0x08035004
> > > [  156.617963] VM fault (0x04, vmid 4) at page 69399, read from VGT (53)
> > 
> > Sounds like a different problem triggered by the same patchset to me.
> > 
> > But first things first, is the original issue with the list corruption
> > fixed? If yes we can start to look into this one as well.
> 
> It's OK so far, but then I need more time as I don't really know how to
> trigger it and last time I called it as OK (in another bug) it wasn't.

Still haven't crashed Elemental but have got -

[29066.333908] [drm:radeon_gem_va_update_vm] *ERROR* Couldn't update BO_VA (-512)
[29066.335653] [drm:radeon_gem_va_update_vm] *ERROR* Couldn't update BO_VA (-512)
Comment 9 Andy Furniss 2014-10-30 09:55:05 UTC
(In reply to Christian König from comment #6)

> But first things first, is the original issue with the list corruption
> fixed? If yes we can start to look into this one as well.

Enough time has passed now, so I do think that the patch fixed the list corruption.
Comment 10 Lorenzo Bona 2014-12-30 09:54:34 UTC
I found same issues here.

[ 1384.901951] [drm:radeon_gem_va_ioctl [radeon]] *ERROR* Couldn't update BO_VA (-512)
[ 1453.198866] [drm:radeon_gem_va_ioctl [radeon]] *ERROR* Couldn't update BO_VA (-512)
[ 2215.773607] [drm:radeon_gem_va_ioctl [radeon]] *ERROR* Couldn't update BO_VA (-512)
[ 2351.238014] [drm:radeon_gem_va_ioctl [radeon]] *ERROR* Couldn't update BO_VA (-512)
[ 3877.903397] [drm:radeon_gem_va_ioctl [radeon]] *ERROR* Couldn't update BO_VA (-512)

Self compiled kernel from Linus git.
3.19-rc2+ right now.
Comment 11 Michel Dänzer 2015-01-08 02:00:57 UTC
(In reply to Lorenzo Bona from comment #10)
> 
> [ 1384.901951] [drm:radeon_gem_va_ioctl [radeon]] *ERROR* Couldn't update
> BO_VA (-512)

Christian, any ideas for these? Various people including myself are still hitting them occasionally.
Comment 12 Christian König 2015-01-08 15:48:32 UTC
Created attachment 111961 [details] [review]
Fix for printing the error message
Comment 13 Christian König 2015-01-08 15:48:56 UTC
(In reply to Michel Dänzer from comment #11)
> (In reply to Lorenzo Bona from comment #10)
> > 
> > [ 1384.901951] [drm:radeon_gem_va_ioctl [radeon]] *ERROR* Couldn't update
> > BO_VA (-512)
> 
> Christian, any ideas for these? Various people including myself are still
> hitting them occasionally.

Ups, yeah trivial to fix.
Comment 14 Andy Furniss 2015-07-30 11:49:54 UTC
Should have been closed some time ago
Comment 15 Christian König 2015-08-02 11:40:04 UTC
*** Bug 88211 has been marked as a duplicate of this bug. ***
Comment 16 Christian König 2015-08-02 11:41:48 UTC
Let's close this.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.