Bug 93264

Summary:

Tonga VM Faults since llvm ScheduleDAGInstrs: Rework schedule graph builder.

Product:

DRI

Reporter:

Andy Furniss <adf.lists>

Component:

DRM/AMDgpu

Assignee:

Default DRI bug account <dri-devel>

Status:

RESOLVED FIXED

QA Contact:

Severity:

normal

Priority:

medium

CC:

nhaehnle, shawn.starr, tstellar, vedran

Version:

DRI git

Hardware:

x86-64 (AMD64)

OS:

Linux (All)

Whiteboard:

i915 platform:

i915 features:

Attachments:

Description	Flags
llvm-c0a189c.mesa-caf12bebd.run008.check_vm-dump	none
llvm-c0a189c.mesa-caf12bebd.run008.shader.113a9b	none
llvm-c0a189c.mesa-caf12bebd.run008.shader.16f00d	none
llvm-26ddca1.mesa-caf12bebd.run009.shader.vert	none
llvm-26ddca1.mesa-caf12bebd.run009.shader.frag	none
dmesg.faults	none

Description Andy Furniss 2015-12-05 18:24:42 UTC

R9285 using Unreal ElementalDemo to trigger this.

It doesn't start till well into the demo at the same place that triggered an older resolved issue.

https://bugs.freedesktop.org/show_bug.cgi?id=93015
(so maybe Nicolai knows what happens at this point in demo)

bisecting llvm came up with

c0a189c3792865257c1383f176e5401373ed2270 is the first bad commit
commit c0a189c3792865257c1383f176e5401373ed2270
Author: Matthias Braun <matze@braunis.de>
Date:   Thu Dec 3 02:05:27 2015 +0000

    ScheduleDAGInstrs: Rework schedule graph builder.
    
    The new algorithm remembers the uses encountered while walking backwards
    until a matching def is found. Contrary to the previous version this:
    - Works without LiveIntervals being available
    - Allows to increase the precision to subregisters/lanemasks
      (not used for now)
    
    The changes in the AMDGPU tests are necessary because the R600 scheduler
    is not stable with respect to the order of nodes in the ready queues.
    
    Differential Revision: http://reviews.llvm.org/D9068


The demo continues to run/render OK, but I get thousands of -

amdgpu 0000:01:00.0: GPU fault detected: 147 0x07d04401
amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x092D80FA
amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A044001
VM fault (0x01, vmid 5) at page 153977082, read from 'TC7' (0x54433700) (68)
amdgpu 0000:01:00.0: GPU fault detected: 147 0x07d00401
amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0022D16B
amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A0C4002
VM fault (0x02, vmid 5) at page 2281835, read from 'TC4' (0x54433400) (196)
amdgpu 0000:01:00.0: GPU fault detected: 147 0x07d04001
amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0022D163

Comment 1 Nicolai Hähnle 2015-12-07 01:16:13 UTC

Thanks for the report and the bisection! Interestingly, I was able to reproduce this, but only with the very latest Mesa. So it might actually be some kind of interaction between Mesa and LLVM - or possibly a red herring because it doesn't reproduce reliably? I'll investigate more tomorrow.

Comment 2 Andy Furniss 2015-12-07 22:32:09 UTC

Maybe there is some mesa that doesn't do it I suppose. I don't update llvm as often as mesa so could have missed that.

When I noticed this I reset mesa back to where the old bug fixes went in, as I knew that used to be good. It was still bad, then I tried it on older kernels, still bad so I got back on head and started testing llvm.

On mesa head the llvm bisect does seem good - I haven't had much time, but a few runs with llvm sitting on the bad were all bad and a few on the one before all good. It was only a quick test - I didn't throw cpufreq into the mix. When I saw the result of the bisect I did think red herring as it wasn't AMD - but then was slightly relieved when AMD got a mention in the commit message.

Comment 3 Nicolai Hähnle 2015-12-08 22:12:14 UTC

Time to document some information I've gathered.

I can now confirm that Mesa has nothing to do with it. Something must have gone wrong with my builds initially, sorry for having caused confusion.

I captured an apitrace for better reproducibility[0] and ran it with shader dumps enabled and flushing after each draw call in the "interesting" region.

I am going to attach the R600_DEBUG=check_vm dump which I've cross-referenced with R600_DEBUG=vm to obtain the shaders that were active during the draw call (file names with prefix llvm-c0a189c.mesa-caf12bebd). I then matched the shaders to those dumped by a run with a good version of LLVM (commit just before the bad one, file names with prefix llvm-26ddca1.mesa-caf12bebd).

Clearly, the LLVM changes caused some significant re-ordering of the instruction schedule, and that somehow, surprisingly, seems to be responsible for the VM faults.

Another aspect to note is that the shaders are compiled before draw call 174000, while the VM faults happen shortly after draw call 178000. This seems to suggest that the shaders alone only cause VM faults in conjunction with some other state. However, the VM faults have always happened in exactly the same point so far, so it does appear to be deterministic.

[0] The demo always causes VM faults, so I'm not going to upload the giant trace; however, the timing varies between runs, so it's cleaner to reproduce using a trace.

Comment 4 Nicolai Hähnle 2015-12-08 22:13:11 UTC

Created attachment 120421 [details]
llvm-c0a189c.mesa-caf12bebd.run008.check_vm-dump

Comment 5 Nicolai Hähnle 2015-12-08 22:14:14 UTC

Created attachment 120422 [details]
llvm-c0a189c.mesa-caf12bebd.run008.shader.113a9b

This is the vertex shader

Comment 6 Nicolai Hähnle 2015-12-08 22:14:36 UTC

Created attachment 120423 [details]
llvm-c0a189c.mesa-caf12bebd.run008.shader.16f00d

Fragment shader

Comment 7 Nicolai Hähnle 2015-12-08 22:14:53 UTC

Created attachment 120424 [details]
llvm-26ddca1.mesa-caf12bebd.run009.shader.vert

Comment 8 Nicolai Hähnle 2015-12-08 22:15:12 UTC

Created attachment 120425 [details]
llvm-26ddca1.mesa-caf12bebd.run009.shader.frag

Comment 9 Nicolai Hähnle 2015-12-08 22:54:53 UTC

Created attachment 120426 [details]
dmesg.faults

Sample extractions of the first four reported VM faults across different runs.

Note how something always wants to access page 0x092D80FA, and later accesses look like they could originate from something being very confused about the memory layout of textures, e.g. 0x00126AAC and 0x001A6A8C differing only in two bits.

Comment 10 Michel Dänzer 2015-12-09 08:19:38 UTC

(In reply to Nicolai Hähnle from comment #9)
> Note how something always wants to access page 0x092D80FA, and later
> accesses look like they could originate from something being very confused
> about the memory layout of textures, e.g. 0x00126AAC and 0x001A6A8C
> differing only in two bits.

Those symptoms could be the result of incorrect spilling/restoring of register values, causing corrupted VM addresses to be used.

Comment 11 Vedran Miletić 2015-12-14 14:09:17 UTC

Not sure if it's related, but I get precisely the same error (GPU fault detected: 147) when running a hello world OpenCL program using PyOpenCL.

Comment 12 Vedran Miletić 2015-12-14 18:38:28 UTC

Probably not, because LLVM version just one commit prior the commit mentioned in comment #1 did not help. I opened a new bug, bug 93374.

Comment 13 Nicolai Hähnle 2015-12-19 01:41:01 UTC

This is fixed in LLVM master as of r256072.

Comment 14 Shawn Starr 2015-12-19 10:38:39 UTC

*** Bug 93436 has been marked as a duplicate of this bug. ***

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.