R9285 using Unreal ElementalDemo to trigger this.
It doesn't start till well into the demo at the same place that triggered an older resolved issue.
(so maybe Nicolai knows what happens at this point in demo)
bisecting llvm came up with
c0a189c3792865257c1383f176e5401373ed2270 is the first bad commit
Author: Matthias Braun <firstname.lastname@example.org>
Date: Thu Dec 3 02:05:27 2015 +0000
ScheduleDAGInstrs: Rework schedule graph builder.
The new algorithm remembers the uses encountered while walking backwards
until a matching def is found. Contrary to the previous version this:
- Works without LiveIntervals being available
- Allows to increase the precision to subregisters/lanemasks
(not used for now)
The changes in the AMDGPU tests are necessary because the R600 scheduler
is not stable with respect to the order of nodes in the ready queues.
Differential Revision: http://reviews.llvm.org/D9068
The demo continues to run/render OK, but I get thousands of -
amdgpu 0000:01:00.0: GPU fault detected: 147 0x07d04401
amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x092D80FA
amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A044001
VM fault (0x01, vmid 5) at page 153977082, read from 'TC7' (0x54433700) (68)
amdgpu 0000:01:00.0: GPU fault detected: 147 0x07d00401
amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0022D16B
amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A0C4002
VM fault (0x02, vmid 5) at page 2281835, read from 'TC4' (0x54433400) (196)
amdgpu 0000:01:00.0: GPU fault detected: 147 0x07d04001
amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0022D163
Thanks for the report and the bisection! Interestingly, I was able to reproduce this, but only with the very latest Mesa. So it might actually be some kind of interaction between Mesa and LLVM - or possibly a red herring because it doesn't reproduce reliably? I'll investigate more tomorrow.
Maybe there is some mesa that doesn't do it I suppose. I don't update llvm as often as mesa so could have missed that.
When I noticed this I reset mesa back to where the old bug fixes went in, as I knew that used to be good. It was still bad, then I tried it on older kernels, still bad so I got back on head and started testing llvm.
On mesa head the llvm bisect does seem good - I haven't had much time, but a few runs with llvm sitting on the bad were all bad and a few on the one before all good. It was only a quick test - I didn't throw cpufreq into the mix. When I saw the result of the bisect I did think red herring as it wasn't AMD - but then was slightly relieved when AMD got a mention in the commit message.
Time to document some information I've gathered.
I can now confirm that Mesa has nothing to do with it. Something must have gone wrong with my builds initially, sorry for having caused confusion.
I captured an apitrace for better reproducibility and ran it with shader dumps enabled and flushing after each draw call in the "interesting" region.
I am going to attach the R600_DEBUG=check_vm dump which I've cross-referenced with R600_DEBUG=vm to obtain the shaders that were active during the draw call (file names with prefix llvm-c0a189c.mesa-caf12bebd). I then matched the shaders to those dumped by a run with a good version of LLVM (commit just before the bad one, file names with prefix llvm-26ddca1.mesa-caf12bebd).
Clearly, the LLVM changes caused some significant re-ordering of the instruction schedule, and that somehow, surprisingly, seems to be responsible for the VM faults.
Another aspect to note is that the shaders are compiled before draw call 174000, while the VM faults happen shortly after draw call 178000. This seems to suggest that the shaders alone only cause VM faults in conjunction with some other state. However, the VM faults have always happened in exactly the same point so far, so it does appear to be deterministic.
 The demo always causes VM faults, so I'm not going to upload the giant trace; however, the timing varies between runs, so it's cleaner to reproduce using a trace.
Created attachment 120421 [details]
Created attachment 120422 [details]
This is the vertex shader
Created attachment 120423 [details]
Created attachment 120424 [details]
Created attachment 120425 [details]
Created attachment 120426 [details]
Sample extractions of the first four reported VM faults across different runs.
Note how something always wants to access page 0x092D80FA, and later accesses look like they could originate from something being very confused about the memory layout of textures, e.g. 0x00126AAC and 0x001A6A8C differing only in two bits.
(In reply to Nicolai Hähnle from comment #9)
> Note how something always wants to access page 0x092D80FA, and later
> accesses look like they could originate from something being very confused
> about the memory layout of textures, e.g. 0x00126AAC and 0x001A6A8C
> differing only in two bits.
Those symptoms could be the result of incorrect spilling/restoring of register values, causing corrupted VM addresses to be used.
Not sure if it's related, but I get precisely the same error (GPU fault detected: 147) when running a hello world OpenCL program using PyOpenCL.
Probably not, because LLVM version just one commit prior the commit mentioned in comment #1 did not help. I opened a new bug, bug 93374.
This is fixed in LLVM master as of r256072.
*** Bug 93436 has been marked as a duplicate of this bug. ***