Bug 91278 - Tonga GPU lock/reset fail with Unigine Valley
Summary: Tonga GPU lock/reset fail with Unigine Valley
Status: RESOLVED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: XOrg git
Hardware: Other All
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
: 92087 (view as bug list)
Depends on:
Blocks:
 
Reported: 2015-07-09 10:10 UTC by Andy Furniss
Modified: 2016-09-16 01:45 UTC (History)
4 users (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg with lock (79.37 KB, text/plain)
2015-07-09 10:11 UTC, Andy Furniss
no flags Details
xorg-log for ref (not from lock) (26.08 KB, text/plain)
2015-07-09 10:12 UTC, Andy Furniss
no flags Details
Kernel log of hang (693.28 KB, text/plain)
2015-08-17 19:25 UTC, Mathias Tillman
no flags Details
hung tast with current agd5f drm-next-4.3 (8.08 KB, text/plain)
2015-08-28 10:45 UTC, Andy Furniss
no flags Details
apitrace of hang (695.94 KB, text/plain)
2015-09-25 17:30 UTC, Mathias Tillman
no flags Details
valley vm fault dump (1.78 MB, text/plain)
2015-10-05 08:54 UTC, Andy Furniss
no flags Details
test kernel patch (1.19 KB, patch)
2015-10-11 16:14 UTC, Grazvydas Ignotas
no flags Details | Splinter Review
journalctl | grep amdgpu (73.11 KB, text/plain)
2015-11-15 11:24 UTC, Daniele Ruffini
no flags Details

Description Andy Furniss 2015-07-09 10:10:12 UTC
R9 285 kernel agd5f amdgpu with or without patches from 

https://bugs.freedesktop.org/show_bug.cgi?id=91141 

mesa is agd5f with a few patches from mainline to build with current llvm.

ddx is git against older xorg.

Simpler games like openarena don't lock.

Valley settings ultra, 8xAA, fullscreen 1920x1080.

Doesn't show in this log but I've also seen some

 [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-35)

around the lock/reset on previous tests.
Comment 1 Andy Furniss 2015-07-09 10:11:09 UTC
Created attachment 117012 [details]
dmesg with lock
Comment 2 Andy Furniss 2015-07-09 10:12:56 UTC
Created attachment 117013 [details]
xorg-log for ref (not from lock)
Comment 3 Andy Furniss 2015-07-17 09:30:42 UTC
I've of course tried various things sine reporting - Valley doesn't always instantly lock. Unreal 4.5 Elemental got half way before locking.

Perhaps more interesting I managed to reset/fail resume just browsing - of course I've done a lot of browsing without issue so far. The difference this time was I had a huge ffmpeg/x265 encode going - it was using all my memory (8 Gig and swap had been used a bit), so it's possible memory pressure plays a role - or maybe just a red herring :-)

I haven't managed to get a reset running timedemos on openarena or xonotic so far - will try with memory pressure as time allows.

The reset when browsing -

-rw-rw-r--  1 andy andy 153K Jun 13 00:04 hacky-fix.jpeg
[ 8052.101670] amdgpu 0000:01:00.0: GPU lockup (waiting for 0x000000000000f019 last fence id 0x000000000000f018 on ring 9)
[ 8052.101672] amdgpu 0000:01:00.0: failed to sync rings (-35)
[ 8052.108912] amdgpu 0000:01:00.0: Saved 9216 dwords of commands on ring 9.
[ 8052.108929] amdgpu 0000:01:00.0: GPU softreset: 0x00000100
[ 8052.108930] amdgpu 0000:01:00.0:   GRBM_STATUS=0x00003028
[ 8052.108932] amdgpu 0000:01:00.0:   GRBM_STATUS2=0x00000008
[ 8052.108934] amdgpu 0000:01:00.0:   GRBM_STATUS_SE0=0x00000006
[ 8052.108935] amdgpu 0000:01:00.0:   GRBM_STATUS_SE1=0x00000006
[ 8052.108937] amdgpu 0000:01:00.0:   GRBM_STATUS_SE2=0x00000006
[ 8052.108938] amdgpu 0000:01:00.0:   GRBM_STATUS_SE3=0x00000006
[ 8052.108940] amdgpu 0000:01:00.0:   SRBM_STATUS=0x20020240
[ 8052.108941] amdgpu 0000:01:00.0:   SRBM_STATUS2=0x00000080
[ 8052.108943] amdgpu 0000:01:00.0:   SDMA0_STATUS_REG   = 0x76DEED57
[ 8052.108945] amdgpu 0000:01:00.0:   SDMA1_STATUS_REG   = 0x46DEED57
[ 8052.108946] amdgpu 0000:01:00.0:   CP_STAT = 0x00000000
[ 8052.108948] amdgpu 0000:01:00.0:   CP_STALLED_STAT1 = 0x00000c00
[ 8052.108949] amdgpu 0000:01:00.0:   CP_STALLED_STAT2 = 0x00000000
[ 8052.108951] amdgpu 0000:01:00.0:   CP_STALLED_STAT3 = 0x00000000
[ 8052.108953] amdgpu 0000:01:00.0:   CP_CPF_BUSY_STAT = 0x00000000
[ 8052.108954] amdgpu 0000:01:00.0:   CP_CPF_STALLED_STAT1 = 0x00000000
[ 8052.108956] amdgpu 0000:01:00.0:   CP_CPF_STATUS = 0x00000000
[ 8052.108957] amdgpu 0000:01:00.0:   CP_CPC_BUSY_STAT = 0x00000000
[ 8052.108959] amdgpu 0000:01:00.0:   CP_CPC_STALLED_STAT1 = 0x00000000
[ 8052.108961] amdgpu 0000:01:00.0:   CP_CPC_STATUS = 0x00000000
[ 8052.108962] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 8052.108964] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[ 8052.109078] amdgpu 0000:01:00.0: SRBM_SOFT_RESET=0x00000400
[ 8052.110233] amdgpu 0000:01:00.0:   GRBM_STATUS=0x00003028
[ 8052.110235] amdgpu 0000:01:00.0:   GRBM_STATUS2=0x00000008
[ 8052.110236] amdgpu 0000:01:00.0:   GRBM_STATUS_SE0=0x00000006
[ 8052.110238] amdgpu 0000:01:00.0:   GRBM_STATUS_SE1=0x00000006
[ 8052.110239] amdgpu 0000:01:00.0:   GRBM_STATUS_SE2=0x00000006
[ 8052.110241] amdgpu 0000:01:00.0:   GRBM_STATUS_SE3=0x00000006
[ 8052.110242] amdgpu 0000:01:00.0:   SRBM_STATUS=0x20020040
[ 8052.110244] amdgpu 0000:01:00.0:   SRBM_STATUS2=0x00000080
[ 8052.110245] amdgpu 0000:01:00.0:   SDMA0_STATUS_REG   = 0x76DEED57
[ 8052.110247] amdgpu 0000:01:00.0:   SDMA1_STATUS_REG   = 0x46DEED57
[ 8052.110248] amdgpu 0000:01:00.0:   CP_STAT = 0x00000000
[ 8052.110250] amdgpu 0000:01:00.0:   CP_STALLED_STAT1 = 0x00000c00
[ 8052.110252] amdgpu 0000:01:00.0:   CP_STALLED_STAT2 = 0x00000000
[ 8052.110253] amdgpu 0000:01:00.0:   CP_STALLED_STAT3 = 0x00000000
[ 8052.110255] amdgpu 0000:01:00.0:   CP_CPF_BUSY_STAT = 0x00000000
[ 8052.110256] amdgpu 0000:01:00.0:   CP_CPF_STALLED_STAT1 = 0x00000000
[ 8052.110258] amdgpu 0000:01:00.0:   CP_CPF_STATUS = 0x00000000
[ 8052.110259] amdgpu 0000:01:00.0:   CP_CPC_BUSY_STAT = 0x00000000
[ 8052.110261] amdgpu 0000:01:00.0:   CP_CPC_STALLED_STAT1 = 0x00000000
[ 8052.110262] amdgpu 0000:01:00.0:   CP_CPC_STATUS = 0x00000000
[ 8052.110282] amdgpu 0000:01:00.0: GPU reset succeeded, trying to resume
[ 8052.110289] [drm] probing gen 2 caps for device 1002:5a16 = 31cd02/0
[ 8052.111446] [drm] PCIE GART of 2048M enabled (table at 0x0000000000040000).
[ 8052.113940] [drm] ring test on 0 succeeded in 10 usecs
[ 8053.856277] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 1 test failed (scratch(0xC040)=0xCAFEDEAD)
[ 8054.049187] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 2 test failed (scratch(0xC040)=0xCAFEDEAD)
[ 8054.242101] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 3 test failed (scratch(0xC040)=0xCAFEDEAD)
[ 8054.435020] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 4 test failed (scratch(0xC040)=0xCAFEDEAD)
[ 8054.627925] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 5 test failed (scratch(0xC040)=0xCAFEDEAD)
[ 8054.820839] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 6 test failed (scratch(0xC040)=0xCAFEDEAD)
[ 8055.013737] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 7 test failed (scratch(0xC040)=0xCAFEDEAD)
[ 8055.206669] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 8 test failed (scratch(0xC040)=0xCAFEDEAD)
[ 8055.313826] [drm:sdma_v3_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 9 test failed (0xCAFEDEAD)
[ 8055.319862] amdgpu 0000:01:00.0: GPU reset failed
[ 8055.320787] amdgpu 0000:01:00.0: couldn't schedule ib
[ 8055.320806] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-22)
[ 8055.320831] amdgpu 0000:01:00.0: couldn't schedule ib
[ 8055.320841] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-22)
[ 8055.320854] amdgpu 0000:01:00.0: couldn't schedule ib
[ 8055.320863] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-22)
Comment 4 Andy Furniss 2015-07-20 23:25:28 UTC
I tested Xonotic with load over the weekend and it did lock after about 10 minutes - but I then tested without any memory pressure and still managed to lock, it did take longer.

I got another lock ehile browsing today - there was nothing else happening at the time, but I had a few minutes earlier compiled llvm and mesa.
Comment 5 Mathias Tillman 2015-08-17 19:25:51 UTC
Created attachment 117739 [details]
Kernel log of hang

Have been getting the same hangs, though I get it while just using the computer normally, or even while it was idle.

Using Ubuntu vivid with kernel 4.2-rc7 from Ubuntu mainline with the oibaf ppa and a self-compiled xf86-video-amdgpu module.
Comment 6 Michel Dänzer 2015-08-18 06:41:45 UTC
(In reply to Mathias Tillman from comment #5)
> Have been getting the same hangs, though I get it while just using the
> computer normally, or even while it was idle.

The symptoms may be similar, but since the circumstances differ, please file your own report.
Comment 7 Andy Furniss 2015-08-28 10:45:34 UTC
Created attachment 117964 [details]
hung tast with current agd5f drm-next-4.3

Valley does sometimes get further with newer gits - I have recently got all the way through the scenes. It does still lock though. 

Attached is a hung task trace with current agd5f drm-next-4.3, libdrm, mesa and a recentish llvm.
Comment 8 Alex Deucher 2015-09-24 15:43:44 UTC
*** Bug 92087 has been marked as a duplicate of this bug. ***
Comment 9 Mathias Tillman 2015-09-25 17:30:51 UTC
Created attachment 118448 [details]
apitrace of hang

Not exactly sure how useful it is, but I have attached an excerpt of an apitrace of the unigine valley demo. I had to cut out most of it, due to the size of it - I've run it several times, and the size has always been >500MB. The last thing in the trace was always glXSwapBuffers, so the excerpt consists of the contents between the last and the next to last glXSwapBuffers lines.
Comment 10 Michel Dänzer 2015-09-25 21:52:45 UTC
Mathias, so you can reliably reproduce the hang by replaying that apitrace?
Comment 11 Mathias Tillman 2015-09-28 09:08:52 UTC
(In reply to Michel Dänzer from comment #10)
> Mathias, so you can reliably reproduce the hang by replaying that apitrace?

Unfortunately I haven't been able to replay the apitrace properly - I just get a black screen with a bunch of errors in the output about not supporting GLSL 1.50, program not supported etc.
I will keep trying though.
Comment 12 Mathias Tillman 2015-09-28 15:35:38 UTC
(In reply to Michel Dänzer from comment #10)
> Mathias, so you can reliably reproduce the hang by replaying that apitrace?

Okay, I have been able to replay the trace by compiling apitrace from source, and renaming glretrace to valley_x64.
I can reproduce the hang by replaying, but not in a very useful way as the hang happens on different frames on each replay, so I wouldn't really call it reproducible at this point.
I will see what different options glretrace gives me, to see if I can find some kind of common denominator between the hangs.
Comment 13 Mathias Tillman 2015-09-28 21:27:45 UTC
Sorry for triple posting, but I have some more info that may or may not be useful.
When enabling verbose output from glretrace, I can see that the next to last operation before a hang is always glBindBuffer (I've run it 10 times) even though the rest of the output is very different. This, however, only happens when double buffer visuals is enabled, if I disable it using --sb the output is the same as above, ending with glXSwapBuffers.
Not sure if this is a coincidence or not, but it seemed interesting to me.
Comment 14 Andy Furniss 2015-09-29 10:32:21 UTC
FWIW I don't think the Valley code/shaders/whatever its self triggers this.

I can run valley for > an hour depending on luck/state of my box.

One way to get a long run for me is to go into mem sleep, come out and while nothing else is running run valley. Based on only a few runs like this, I haven't locked it yet. Randomly starting valley after my PC has been in use all day may lock withing 10 seconds.

I am not saying mem sleep cures all locks, just seems to make it hard. After running valley for an hour one time, I mover onto Unreal elemental, several runs, no lock. I then tried Unreal Atlantis and eventually got it to hang with that, though it ran through the scripted bit and I had to start flying around interactively before the hang.

Unreal Atlantis is a bit different/annoying. Different in that it requests more vram than I have and annoying as it makes a 1.8 gig cache file under $HOME/.config.

This mem sleep observation may be luck. I also haven't tested some variants yet like with vblank_mode=0 or cpufreq_ondemand settings. I run valley all maxed fullscreen - so yet more variables verses what others may be running it with.
Comment 16 Mathias Tillman 2015-09-29 20:33:30 UTC
(In reply to Alex Deucher from comment #15)
> These patches may help:
> http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20150928/302380.html
> http://lists.freedesktop.org/archives/mesa-dev/2015-September/095718.html
Afraid not, applied both of them and it still hangs. One interesting thing is that I was using mesa from the oibaf ppa which compiles against llvm 3.6. While using that I haven't been able to replay my apitrace of valley once - it always hangs before it finishes. However, I compiled mesa against llvm 3.7 (one compiled from source with the patch, and one from llvm's apt repository) and 3.8, and it gets much further now - I've been able to replay the trace three times without a hang, though it does ultimately hang unfortunately.
Comment 17 Andy Furniss 2015-09-30 09:38:20 UTC
Haven't had time yet to hang with the patches.

Yesterday without I them I hung, rebooted, did the memsleep, then tested the rest of the day trying to lock valley and unreal but couldn't. For the whole day, the only logging I got was a few hundred -

Sep 29 18:10:47 ph4 kernel: VM fault (0x04, vmid 4) at page 1529213, read from 'TC6' (0x54433600) (72)
Sep 29 18:10:49 ph4 kernel: amdgpu 0000:01:00.0: GPU fault detected: 146 0x0be84804
Sep 29 18:10:49 ph4 kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0017557D
Sep 29 18:10:49 ph4 kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x08048004
Sep 29 18:10:49 ph4 kernel: VM fault (0x04, vmid 4) at page 1529213, read from 'TC6' (0x54433600) (72)

Last thing I applied the patches to couple of days old llvm and mesa gits.

This morning ran valley from power off boot after a bit of browsing/mail (yesterday this hung).

Only a quick test which I stopped, looked OK but in dmesg I have >10k of -

[ 1792.292640] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0918c404
[ 1792.292643] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00136F23
[ 1792.292644] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x060C4004
[ 1792.292646] VM fault (0x04, vmid 3) at page 1273635, read from 'TC4' (0x54433400) (196)
[ 1792.292650] amdgpu 0000:01:00.0: GPU fault detected: 146 0x09184404
[ 1792.292651] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 1792.292652] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[ 1792.292654] VM fault (0x00, vmid 0) at page 0, read from '' (0x00000000) (0)
[ 1792.292658] amdgpu 0000:01:00.0: GPU fault detected: 146 0x09188404
[ 1792.292659] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 1792.292660] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[ 1792.292661] VM fault (0x00, vmid 0) at page 0, read from '' (0x00000000) (0)
[ 1792.292666] amdgpu 0000:01:00.0: GPU fault detected: 146 0x09180404
[ 1792.292667] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 1792.292668] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[ 1792.292669] VM fault (0x00, vmid 0) at page 0, read from '' (0x00000000) (0)
[ 1792.375515] amdgpu 0000:01:00.0: GPU fault detected: 146 0x09188404
[ 1792.375518] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00136F23
[ 1792.375519] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06084004
[ 1792.375521] VM fault (0x04, vmid 3) at page 1273635, read from 'TC10' (0x54433130) (132)
[ 1792.375526] amdgpu 0000:01:00.0: GPU fault detected: 146 0x09184404
[ 1792.375527] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 1792.375528] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[ 1792.375530] VM fault (0x00, vmid 0) at page 0, read from '' (0x00000000) (0)
[ 1792.375534] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0918c404
[ 1792.375535] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 1792.375536] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[ 1792.375538] VM fault (0x00, vmid 0) at page 0, read from '' (0x00000000) (0)
[ 1792.375542] amdgpu 0000:01:00.0: GPU fault detected: 146 0x09180404
[ 1792.375543] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 1792.375544] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[ 1792.375546] VM fault (0x00, vmid 0) at page 0, read from '' (0x00000000) (0)
[ 1792.432272] amdgpu 0000:01:00.0: GPU fault detected: 146 0x09184404
[ 1792.432276] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00136F23
[ 1792.432277] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06044004
[ 1792.432280] VM fault (0x04, vmid 3) at page 1273635, read from 'TC7' (0x54433700) (68)
Comment 18 Andy Furniss 2015-09-30 10:08:57 UTC
So after last update I ran valley again briefly and saw a few vmfaults, then a longer run and got thousands again.

without touching anything else did echo mem >/sys/power/state and then woke up.

10 minute run of valley has produced zero faults.
Comment 19 Andy Furniss 2015-09-30 19:59:53 UTC
(In reply to Andy Furniss from comment #18)
> So after last update I ran valley again briefly and saw a few vmfaults, then
> a longer run and got thousands again.
> 
> without touching anything else did echo mem >/sys/power/state and then woke
> up.
> 
> 10 minute run of valley has produced zero faults.

Further test from power off, nothing else running apart from X/fluxox short run of valley no faults. Reran valley for a bit longer and got thousands. Did memsleep ran valley no faults but after about 10 minutes it hung.
Comment 20 Mathias Tillman 2015-09-30 20:42:20 UTC
(In reply to Andy Furniss from comment #19)
> (In reply to Andy Furniss from comment #18)
> > So after last update I ran valley again briefly and saw a few vmfaults, then
> > a longer run and got thousands again.
> > 
> > without touching anything else did echo mem >/sys/power/state and then woke
> > up.
> > 
> > 10 minute run of valley has produced zero faults.
> 
> Further test from power off, nothing else running apart from X/fluxox short
> run of valley no faults. Reran valley for a bit longer and got thousands.
> Did memsleep ran valley no faults but after about 10 minutes it hung.

Do you get those GPU faults in the log even when there's no hang? I haven't checked dmesg while running valley myself, but I do know they always appear when a hang has happened (I'm using ssh to grab dmesg while it's hung).

Dmesg is sometimes completely filled with GPU faults, other times it's just a few. I ran it a few minutes ago and only got this:

[ 1737.984328] amdgpu 0000:01:00.0: GPU fault detected: 146 0x08804804
[ 1737.984338] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00100110
[ 1737.984343] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A048004
[ 1737.984348] VM fault (0x04, vmid 5) at page 1048848, read from 'TC6' (0x54433600) (72)
[ 1737.984355] amdgpu 0000:01:00.0: GPU fault detected: 146 0x08804004
[ 1737.984359] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 1737.984363] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[ 1737.984366] VM fault (0x00, vmid 0) at page 0, read from '' (0x00000000) (0)
[ 1737.984374] amdgpu 0000:01:00.0: GPU fault detected: 146 0x08800804
[ 1737.984378] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 1737.984381] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[ 1737.984384] VM fault (0x00, vmid 0) at page 0, read from '' (0x00000000) (0)
Comment 21 Andy Furniss 2015-09-30 21:15:08 UTC
(In reply to Mathias Tillman from comment #20)

> Do you get those GPU faults in the log even when there's no hang?

Yes and I can also hang without getting any.
Comment 22 Mathias Tillman 2015-09-30 21:19:52 UTC
(In reply to Andy Furniss from comment #21)
> (In reply to Mathias Tillman from comment #20)
> 
> > Do you get those GPU faults in the log even when there's no hang?
> 
> Yes and I can also hang without getting any.

Actually, I think I've seen hangs without the GPU faults too now that I think about it. Makes you wonder if the GPU faults are related at all, or if this is something else entirely.
Comment 23 Andy Furniss 2015-09-30 21:51:07 UTC
(In reply to Mathias Tillman from comment #22)
> (In reply to Andy Furniss from comment #21)
> > (In reply to Mathias Tillman from comment #20)
> > 
> > > Do you get those GPU faults in the log even when there's no hang?
> > 
> > Yes and I can also hang without getting any.
> 
> Actually, I think I've seen hangs without the GPU faults too now that I
> think about it. Makes you wonder if the GPU faults are related at all, or if
> this is something else entirely.

Yea, could be unrelated.

I haven't seen thousands until Today so that aspect could be to do with the patches.
Comment 24 Andy Furniss 2015-09-30 21:55:38 UTC
(In reply to Andy Furniss from comment #23)

> I haven't seen thousands until Today so that aspect could be to do with the
> patches.

Ignore that, more grepping of kernel log does show I have got thousands before today.
Comment 25 Andy Furniss 2015-10-04 20:34:21 UTC
(In reply to Andy Furniss from comment #24)
> (In reply to Andy Furniss from comment #23)
> 
> > I haven't seen thousands until Today so that aspect could be to do with the
> > patches.
> 
> Ignore that, more grepping of kernel log does show I have got thousands
> before today.

Haven't had time to test thoroughly, but I see the latest updated agd5f fixes has a commit to reduce vm faults and I haven't got thousands since changing to that.

I also see a new R600_DEBUG=check_vm in mesa. I don't know what, if any, extra info is expected from that, but testing valley with it caused it to quit when it hit a fault after a about minute of running -

Detected a VM fault, exiting...

in dmesg -

[  261.017278] amdgpu 0000:01:00.0: GPU fault detected: 146 0x01e84804
[  261.017290] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0015703D
[  261.017296] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06048004
[  261.017302] VM fault (0x04, vmid 3) at page 1404989, read from 'TC6' (0x54433600) (72)
Comment 26 Michel Dänzer 2015-10-05 06:23:35 UTC
(In reply to Andy Furniss from comment #25)
> I also see a new R600_DEBUG=check_vm in mesa. I don't know what, if any,
> extra info is expected from that, but testing valley with it caused it to
> quit when it hit a fault after a about minute of running -

It should generate a file in ~/ddebug_dumps/ with more information about the VM fault.
Comment 27 Andy Furniss 2015-10-05 08:54:37 UTC
Created attachment 118665 [details]
valley vm fault dump

Aha, here it is.
Comment 28 Mathias Tillman 2015-10-05 22:51:56 UTC
(In reply to Andy Furniss from comment #27)
> Created attachment 118665 [details]
> valley vm fault dump
> 
> Aha, here it is.

I am getting similar results using check_vm. However, since the hang usually doesn't happen after the first vm fail (it did actually for me a few times, but most of the time it didn't) I modified the mesa code to disable closing the program once a vm fault happened. Unfortunately this doesn't really provide me with any interesting information - they have all complained about the paging fault happening in a VERTEX_BUFFER with an identical, or close to identical address.
Comment 29 Michel Dänzer 2015-10-06 01:25:30 UTC
That is interesting, though; the radeonsi driver seems to think there should be something mapped at the faulting address. This indicates that either the kernel driver fails to handle the mapping properly, or maybe there's a problem with communicating the buffer mapping information from userspace to the kernel driver.
Comment 30 Mathias Tillman 2015-10-09 16:08:23 UTC
Just updated to latest drm-next-4.4-wip and mesa and reran the valley test. Been running for two hours now without a hang (although there have been vm faults). Will test more over the weekend, but it's looking good so far :)
Comment 31 Mathias Tillman 2015-10-10 07:49:20 UTC
Small update: I have been able to run it for a total of 7 hours (2 + 2 + 3) without a hang on the latest drm-next-4.4-wip. Tried the latest drm-fixes-4.3 and it hung after about 30 minutes, so I definitely think there's something in 4.4 that fixes it. I will do a bisect to see if I can figure out what, more exactly, is the fix.
Comment 32 Andy Furniss 2015-10-10 08:59:03 UTC
(In reply to Mathias Tillman from comment #31)
> Small update: I have been able to run it for a total of 7 hours (2 + 2 + 3)
> without a hang on the latest drm-next-4.4-wip. Tried the latest
> drm-fixes-4.3 and it hung after about 30 minutes, so I definitely think
> there's something in 4.4 that fixes it. I will do a bisect to see if I can
> figure out what, more exactly, is the fix.

I'll try over the weekend - no locks so far.

Bisecting would be a pain for me as I managed to very long lucky runs previously, so wouldn't be able to easily call good.

I notice that enable_scheduler is now on by default, maybe flipping that would be quicker than bisect. Long ago I did try some older kernel with that enabled and though it didn't fix IIRC it took longer to lock than was normal at that time.
Comment 33 Mathias Tillman 2015-10-10 10:33:39 UTC
(In reply to Andy Furniss from comment #32)
> (In reply to Mathias Tillman from comment #31)
> > Small update: I have been able to run it for a total of 7 hours (2 + 2 + 3)
> > without a hang on the latest drm-next-4.4-wip. Tried the latest
> > drm-fixes-4.3 and it hung after about 30 minutes, so I definitely think
> > there's something in 4.4 that fixes it. I will do a bisect to see if I can
> > figure out what, more exactly, is the fix.
> 
> I'll try over the weekend - no locks so far.
> 
> Bisecting would be a pain for me as I managed to very long lucky runs
> previously, so wouldn't be able to easily call good.
> 
> I notice that enable_scheduler is now on by default, maybe flipping that
> would be quicker than bisect. Long ago I did try some older kernel with that
> enabled and though it didn't fix IIRC it took longer to lock than was normal
> at that time.

Just ran it again with the scheduler disabled (from code, not through the module parameter) on 4.4-wip, and sure enough, after about 30 minutes it hung. So it looks like there's something in the scheduler that either makes it not happen as often, or not at all (will need to confirm this).
Comment 34 Grazvydas Ignotas 2015-10-11 16:14:11 UTC
Created attachment 118824 [details] [review]
test kernel patch

(In reply to Michel Dänzer from comment #29)
> That is interesting, though; the radeonsi driver seems to think there should
> be something mapped at the faulting address. This indicates that either the
> kernel driver fails to handle the mapping properly, or maybe there's a
> problem with communicating the buffer mapping information from userspace to
> the kernel driver.

Judging by the symptoms it feels like some caching/buffering problem somewhere. 

If I understand the code right, most of things are mapped write-combine, which means the CPU is allowed to write data it any order it likes. Looking at amdgpu/radeon code, there is surprising lack of barriers, basically it's just amdgpu_ring_commit()/radeon_ring_commit() and that's it. But mb() doesn't guarantee that the writes will arrive in program order, it just ensures that all the writes are finished after that mb() statement.

So the question is, is it ok for the hardware if in something like amdgpu_ib_schedule() the writes to the ring arrive before the writes to IB? I do admit I don't understand how the hardware works, like what triggers the hardware to start processing the ring contents, perhaps the write to the last word in the ring? If so you clearly need a wmb() before the write which triggers the hardware so that everything is ready before the GPU kicks in.

Attached is a debug kernel patch to test if my guess is correct. It's way overkill and will trash performance, but it should show if this is a problem related to CPU caching/buffering. I don't have the hardware to test this myself.
Comment 35 Christian König 2015-10-11 16:26:58 UTC
The ring and IB are just normal system memory made accessible to the GPU. So it's perfectly fine that it's mapped WC by the CPU.

The processing of the commands written into the ring and IB are triggered by writing the wptr register.

See radeon_ring_set_wptr().
Comment 36 Mathias Tillman 2015-10-11 19:37:00 UTC
I have run the valley test the entire day (10 hours or so) without a single VM fault or hang with the GPU scheduler enabled on 4.4.

Re-ran the test with your (Grazvydas) patch and it's showing the same symptoms as before when not using the schedluer - that is, a bunch of VM faults, which lead to a hang after around 30 minutes or so.
Comment 37 Grazvydas Ignotas 2015-10-11 20:50:44 UTC
ok then my guess is wrong.
Comment 38 Andy Furniss 2015-10-11 21:44:45 UTC
(In reply to Andy Furniss from comment #32)

> I'll try over the weekend - no locks so far.

Still good, I've also tried to hang with Unreal demos and mplayer+uvd, no hangs and no vm faults so far.
Comment 39 Mathias Tillman 2015-10-12 11:58:59 UTC
(In reply to Andy Furniss from comment #38)
> (In reply to Andy Furniss from comment #32)
> 
> > I'll try over the weekend - no locks so far.
> 
> Still good, I've also tried to hang with Unreal demos and mplayer+uvd, no
> hangs and no vm faults so far.

Assuming you tried it with the scheduler enabled? Just tried it with the scheduler enabled and semaphores disabled on drm-fixes-4.3, and it also seems to work.

Is the scheduler going to be left enabled from now on? If so we could probably say that this issue has been solved (though more testing is probably required). However, something is causing all of those vm faults when the scheduler is disabled, so it might still be worth looking into if it's going to be possible to disable it using a module parameter.
Comment 40 Andy Furniss 2015-10-13 18:24:07 UTC
(In reply to Mathias Tillman from comment #39)
> (In reply to Andy Furniss from comment #38)
> > (In reply to Andy Furniss from comment #32)
> > 
> > > I'll try over the weekend - no locks so far.
> > 
> > Still good, I've also tried to hang with Unreal demos and mplayer+uvd, no
> > hangs and no vm faults so far.
> 
> Assuming you tried it with the scheduler enabled?

Yea, I am running with the default for drm-next-4.4-wip = enabled.
Comment 41 Mathias Tillman 2015-10-15 09:45:23 UTC
Just tried it again with the latest drm-next-4.4 and the kernel parameters amdgpu.enable_scheduler=0 amdgpu.vm_debug=1 amdgpu.enable_semaphores=1. Can't even make it to the beginning of the test (it hangs on the loading screen, if not before that - I have OpenGL acceleration enabled in kwin) when vm_debug=1 before hanging with a bunch of VM faults. If I set vm_debug to 0 it at least starts properly.
Not sure if that was the intended behaviour, but that's what happens in any case.
Comment 42 Alex Deucher 2015-10-15 13:28:22 UTC
(In reply to Mathias Tillman from comment #41)
> Just tried it again with the latest drm-next-4.4 and the kernel parameters
> amdgpu.enable_scheduler=0 amdgpu.vm_debug=1 amdgpu.enable_semaphores=1.

I don't think forcing semaphores will work on tonga/fiji since there were disabled in the code a while ago due to hw bugs.
Comment 43 Christian König 2015-10-15 13:51:02 UTC
(In reply to Alex Deucher from comment #42)
> (In reply to Mathias Tillman from comment #41)
> > Just tried it again with the latest drm-next-4.4 and the kernel parameters
> > amdgpu.enable_scheduler=0 amdgpu.vm_debug=1 amdgpu.enable_semaphores=1.
> 
> I don't think forcing semaphores will work on tonga/fiji since there were
> disabled in the code a while ago due to hw bugs.

Yeah, forcefully enabling semaphores on Tonga will crash rather fast.

The purpose of vm_debug != 0 is to stop after the first VM fault.

So if you got a VM fault from time to time which was just ignored than setting vm_debug will certainly crash the system.

On the other hand you shouldn't get VM faults in the first place.
Comment 44 Mathias Tillman 2015-10-15 15:51:30 UTC
Just done some tests:

enable_scheduler=0, enable_semaphores=1, vm_fault_stop=0, vm_debug=1
Hang after a few  minutes, normal desktop use or running the valley test. Many VM faults in dmesg.

enable_scheduler=0, enable_semaphores=0, vm_fault_stop=1, vm_debug=1
Hang after a few  minutes, normal desktop use or running the valley test. Only one VM fault visible in dmesg, that would be due to vm_fault_stop=1 I'm guessing?

enable_scheduler=0, enable_semaphores=0, vm_fault_stop=0, vm_debug=1
Same as above, but with several VM faults in dmesg.

enable_scheduler=1, enable_semaphores=0, vm_fault_stop=0, vm_debug=1
All good.
Comment 45 Daniele Ruffini 2015-11-15 11:24:45 UTC
Created attachment 119679 [details]
journalctl | grep amdgpu

Went on for a while, cut short because the error was just repeating
Comment 46 Daniele Ruffini 2015-11-15 11:27:51 UTC
Just forgot to mention that the bug for me happens totally randomly.
No Unigine Heave. No intensive graphic usage.
It seems to happen when something new is displayed tough but i wasn't able to recreate it metodically.

Amd Radeon HD380 4GB.
Comment 47 Andy Furniss 2015-11-15 15:11:54 UTC
Seems like you are using an old kernel - though what I call old may be current for what's released.

The hanging for me stopped with 4.4, but 4.3 is release.

If you are using 4.3 maybe booting with the option -

amdgpu.enable_scheduler=1

will help.

If you are used to compiling your own kernels then using one from -

http://cgit.freedesktop.org/~agd5f/linux/

like drm-next-4.4 should be stable.

Or if you want to test/use the new powerplay try amdgpu-powerplay
Comment 48 EoD 2015-12-28 14:24:04 UTC
I don't get any lockups with kernel 4.4-rc6 and current mesa git on my R380X.
Comment 49 Andy Furniss 2016-09-15 17:04:53 UTC
I can't speak for everyone on the cc with different h/w, but on Tonga valley has been stable for a long time, so closing.

If it's still an issue for anyone you can reopen.
Comment 50 Michel Dänzer 2016-09-16 01:45:18 UTC
(In reply to Andy Furniss from comment #49)
> If it's still an issue for anyone you can reopen.

This is your report, so anyone else please file their own instead.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.