Summary: | Tonga VM Faults since llvm ScheduleDAGInstrs: Rework schedule graph builder. | ||
---|---|---|---|
Product: | DRI | Reporter: | Andy Furniss <adf.lists> |
Component: | DRM/AMDgpu | Assignee: | Default DRI bug account <dri-devel> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | normal | ||
Priority: | medium | CC: | nhaehnle, shawn.starr, tstellar, vedran |
Version: | DRI git | ||
Hardware: | x86-64 (AMD64) | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
Description
Andy Furniss
2015-12-05 18:24:42 UTC
Thanks for the report and the bisection! Interestingly, I was able to reproduce this, but only with the very latest Mesa. So it might actually be some kind of interaction between Mesa and LLVM - or possibly a red herring because it doesn't reproduce reliably? I'll investigate more tomorrow. Maybe there is some mesa that doesn't do it I suppose. I don't update llvm as often as mesa so could have missed that. When I noticed this I reset mesa back to where the old bug fixes went in, as I knew that used to be good. It was still bad, then I tried it on older kernels, still bad so I got back on head and started testing llvm. On mesa head the llvm bisect does seem good - I haven't had much time, but a few runs with llvm sitting on the bad were all bad and a few on the one before all good. It was only a quick test - I didn't throw cpufreq into the mix. When I saw the result of the bisect I did think red herring as it wasn't AMD - but then was slightly relieved when AMD got a mention in the commit message. Time to document some information I've gathered. I can now confirm that Mesa has nothing to do with it. Something must have gone wrong with my builds initially, sorry for having caused confusion. I captured an apitrace for better reproducibility[0] and ran it with shader dumps enabled and flushing after each draw call in the "interesting" region. I am going to attach the R600_DEBUG=check_vm dump which I've cross-referenced with R600_DEBUG=vm to obtain the shaders that were active during the draw call (file names with prefix llvm-c0a189c.mesa-caf12bebd). I then matched the shaders to those dumped by a run with a good version of LLVM (commit just before the bad one, file names with prefix llvm-26ddca1.mesa-caf12bebd). Clearly, the LLVM changes caused some significant re-ordering of the instruction schedule, and that somehow, surprisingly, seems to be responsible for the VM faults. Another aspect to note is that the shaders are compiled before draw call 174000, while the VM faults happen shortly after draw call 178000. This seems to suggest that the shaders alone only cause VM faults in conjunction with some other state. However, the VM faults have always happened in exactly the same point so far, so it does appear to be deterministic. [0] The demo always causes VM faults, so I'm not going to upload the giant trace; however, the timing varies between runs, so it's cleaner to reproduce using a trace. Created attachment 120421 [details]
llvm-c0a189c.mesa-caf12bebd.run008.check_vm-dump
Created attachment 120422 [details]
llvm-c0a189c.mesa-caf12bebd.run008.shader.113a9b
This is the vertex shader
Created attachment 120423 [details]
llvm-c0a189c.mesa-caf12bebd.run008.shader.16f00d
Fragment shader
Created attachment 120424 [details]
llvm-26ddca1.mesa-caf12bebd.run009.shader.vert
Created attachment 120425 [details]
llvm-26ddca1.mesa-caf12bebd.run009.shader.frag
Created attachment 120426 [details]
dmesg.faults
Sample extractions of the first four reported VM faults across different runs.
Note how something always wants to access page 0x092D80FA, and later accesses look like they could originate from something being very confused about the memory layout of textures, e.g. 0x00126AAC and 0x001A6A8C differing only in two bits.
(In reply to Nicolai Hähnle from comment #9) > Note how something always wants to access page 0x092D80FA, and later > accesses look like they could originate from something being very confused > about the memory layout of textures, e.g. 0x00126AAC and 0x001A6A8C > differing only in two bits. Those symptoms could be the result of incorrect spilling/restoring of register values, causing corrupted VM addresses to be used. Not sure if it's related, but I get precisely the same error (GPU fault detected: 147) when running a hello world OpenCL program using PyOpenCL. Probably not, because LLVM version just one commit prior the commit mentioned in comment #1 did not help. I opened a new bug, bug 93374. This is fixed in LLVM master as of r256072. *** Bug 93436 has been marked as a duplicate of this bug. *** |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.