Bug 93264 - Tonga VM Faults since llvm ScheduleDAGInstrs: Rework schedule graph builder.
Summary: Tonga VM Faults since llvm ScheduleDAGInstrs: Rework schedule graph builder.
Status: RESOLVED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: DRI git
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
: 93436 (view as bug list)
Depends on:
Blocks:
 
Reported: 2015-12-05 18:24 UTC by Andy Furniss
Modified: 2015-12-19 10:38 UTC (History)
4 users (show)

See Also:
i915 platform:
i915 features:


Attachments
llvm-c0a189c.mesa-caf12bebd.run008.check_vm-dump (81.58 KB, text/plain)
2015-12-08 22:13 UTC, Nicolai Hähnle
no flags Details
llvm-c0a189c.mesa-caf12bebd.run008.shader.113a9b (25.68 KB, text/plain)
2015-12-08 22:14 UTC, Nicolai Hähnle
no flags Details
llvm-c0a189c.mesa-caf12bebd.run008.shader.16f00d (168.82 KB, text/plain)
2015-12-08 22:14 UTC, Nicolai Hähnle
no flags Details
llvm-26ddca1.mesa-caf12bebd.run009.shader.vert (25.74 KB, text/plain)
2015-12-08 22:14 UTC, Nicolai Hähnle
no flags Details
llvm-26ddca1.mesa-caf12bebd.run009.shader.frag (169.86 KB, text/plain)
2015-12-08 22:15 UTC, Nicolai Hähnle
no flags Details
dmesg.faults (5.26 KB, text/plain)
2015-12-08 22:54 UTC, Nicolai Hähnle
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Andy Furniss 2015-12-05 18:24:42 UTC
R9285 using Unreal ElementalDemo to trigger this.

It doesn't start till well into the demo at the same place that triggered an older resolved issue.

https://bugs.freedesktop.org/show_bug.cgi?id=93015
(so maybe Nicolai knows what happens at this point in demo)

bisecting llvm came up with

c0a189c3792865257c1383f176e5401373ed2270 is the first bad commit
commit c0a189c3792865257c1383f176e5401373ed2270
Author: Matthias Braun <matze@braunis.de>
Date:   Thu Dec 3 02:05:27 2015 +0000

    ScheduleDAGInstrs: Rework schedule graph builder.
    
    The new algorithm remembers the uses encountered while walking backwards
    until a matching def is found. Contrary to the previous version this:
    - Works without LiveIntervals being available
    - Allows to increase the precision to subregisters/lanemasks
      (not used for now)
    
    The changes in the AMDGPU tests are necessary because the R600 scheduler
    is not stable with respect to the order of nodes in the ready queues.
    
    Differential Revision: http://reviews.llvm.org/D9068


The demo continues to run/render OK, but I get thousands of -

amdgpu 0000:01:00.0: GPU fault detected: 147 0x07d04401
amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x092D80FA
amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A044001
VM fault (0x01, vmid 5) at page 153977082, read from 'TC7' (0x54433700) (68)
amdgpu 0000:01:00.0: GPU fault detected: 147 0x07d00401
amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0022D16B
amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A0C4002
VM fault (0x02, vmid 5) at page 2281835, read from 'TC4' (0x54433400) (196)
amdgpu 0000:01:00.0: GPU fault detected: 147 0x07d04001
amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0022D163
Comment 1 Nicolai Hähnle 2015-12-07 01:16:13 UTC
Thanks for the report and the bisection! Interestingly, I was able to reproduce this, but only with the very latest Mesa. So it might actually be some kind of interaction between Mesa and LLVM - or possibly a red herring because it doesn't reproduce reliably? I'll investigate more tomorrow.
Comment 2 Andy Furniss 2015-12-07 22:32:09 UTC
Maybe there is some mesa that doesn't do it I suppose. I don't update llvm as often as mesa so could have missed that.

When I noticed this I reset mesa back to where the old bug fixes went in, as I knew that used to be good. It was still bad, then I tried it on older kernels, still bad so I got back on head and started testing llvm.

On mesa head the llvm bisect does seem good - I haven't had much time, but a few runs with llvm sitting on the bad were all bad and a few on the one before all good. It was only a quick test - I didn't throw cpufreq into the mix. When I saw the result of the bisect I did think red herring as it wasn't AMD - but then was slightly relieved when AMD got a mention in the commit message.
Comment 3 Nicolai Hähnle 2015-12-08 22:12:14 UTC
Time to document some information I've gathered.

I can now confirm that Mesa has nothing to do with it. Something must have gone wrong with my builds initially, sorry for having caused confusion.

I captured an apitrace for better reproducibility[0] and ran it with shader dumps enabled and flushing after each draw call in the "interesting" region.

I am going to attach the R600_DEBUG=check_vm dump which I've cross-referenced with R600_DEBUG=vm to obtain the shaders that were active during the draw call (file names with prefix llvm-c0a189c.mesa-caf12bebd). I then matched the shaders to those dumped by a run with a good version of LLVM (commit just before the bad one, file names with prefix llvm-26ddca1.mesa-caf12bebd).

Clearly, the LLVM changes caused some significant re-ordering of the instruction schedule, and that somehow, surprisingly, seems to be responsible for the VM faults.

Another aspect to note is that the shaders are compiled before draw call 174000, while the VM faults happen shortly after draw call 178000. This seems to suggest that the shaders alone only cause VM faults in conjunction with some other state. However, the VM faults have always happened in exactly the same point so far, so it does appear to be deterministic.

[0] The demo always causes VM faults, so I'm not going to upload the giant trace; however, the timing varies between runs, so it's cleaner to reproduce using a trace.
Comment 4 Nicolai Hähnle 2015-12-08 22:13:11 UTC
Created attachment 120421 [details]
llvm-c0a189c.mesa-caf12bebd.run008.check_vm-dump
Comment 5 Nicolai Hähnle 2015-12-08 22:14:14 UTC
Created attachment 120422 [details]
llvm-c0a189c.mesa-caf12bebd.run008.shader.113a9b

This is the vertex shader
Comment 6 Nicolai Hähnle 2015-12-08 22:14:36 UTC
Created attachment 120423 [details]
llvm-c0a189c.mesa-caf12bebd.run008.shader.16f00d

Fragment shader
Comment 7 Nicolai Hähnle 2015-12-08 22:14:53 UTC
Created attachment 120424 [details]
llvm-26ddca1.mesa-caf12bebd.run009.shader.vert
Comment 8 Nicolai Hähnle 2015-12-08 22:15:12 UTC
Created attachment 120425 [details]
llvm-26ddca1.mesa-caf12bebd.run009.shader.frag
Comment 9 Nicolai Hähnle 2015-12-08 22:54:53 UTC
Created attachment 120426 [details]
dmesg.faults

Sample extractions of the first four reported VM faults across different runs.

Note how something always wants to access page 0x092D80FA, and later accesses look like they could originate from something being very confused about the memory layout of textures, e.g. 0x00126AAC and 0x001A6A8C differing only in two bits.
Comment 10 Michel Dänzer 2015-12-09 08:19:38 UTC
(In reply to Nicolai Hähnle from comment #9)
> Note how something always wants to access page 0x092D80FA, and later
> accesses look like they could originate from something being very confused
> about the memory layout of textures, e.g. 0x00126AAC and 0x001A6A8C
> differing only in two bits.

Those symptoms could be the result of incorrect spilling/restoring of register values, causing corrupted VM addresses to be used.
Comment 11 Vedran Miletić 2015-12-14 14:09:17 UTC
Not sure if it's related, but I get precisely the same error (GPU fault detected: 147) when running a hello world OpenCL program using PyOpenCL.
Comment 12 Vedran Miletić 2015-12-14 18:38:28 UTC
Probably not, because LLVM version just one commit prior the commit mentioned in comment #1 did not help. I opened a new bug, bug 93374.
Comment 13 Nicolai Hähnle 2015-12-19 01:41:01 UTC
This is fixed in LLVM master as of r256072.
Comment 14 Shawn Starr 2015-12-19 10:38:39 UTC
*** Bug 93436 has been marked as a duplicate of this bug. ***


bug/show.html.tmpl processed on Mar 25, 2017 at 07:39:21.
(provided by the Example extension).